Manhattan is Twitter’s internally developed distributed key-value store. It’s the default storage at Twitter for persistent real-time serving needs and is currently used to serve all core nouns at Twitter including Tweets, Users, and Direct Messages (DMs). Manhattan was designed from its inception to have a pluggable storage engine. The storage engine is the lowest layer of Manhattan's stack and it is responsible for storing and retrieving data within an individual node.
As outlined in an earlier blog post, Manhattan had two read-write storage engines:
Most of Manhattan's workloads in production are read-heavy. As a result, the majority of our clusters were running on MhBtree. While MhBtree gave good performance for read-heavy workloads, it had a few shortcomings:
Due to the above shortcomings, our team explored alternative storage engines and discovered RocksDB, which is an LSM-based storage engine that was gaining increased adoption. In particular, the tunable amplification looked very promising and presented an opportunity to unify our two read-write storage engines into a single one that could be tuned differently to different workloads.
In 2017, we kickstarted an evaluation of RocksDB as a new storage engine for Manhattan. Since RocksDB is also an LSM-based storage engine, we were reasonably confident that its performance could match that of SSTable for write workload. Our biggest uncertainty was whether RocksDB could match MhBtree's performance for read-heavy workloads. We wanted to evaluate RocksBD’s performance on these workloads before investing a lot of time into integrating RocksDB into Manhattan and making it production-ready.
So we created a prototype RocksDB storage engine with reduced functionality and relaxed correctness requirements. We set up an experiment cluster with our prototype RocksDB backend as the backing store and ran a variety of synthetic workloads, predominantly the read-heavy workload of our largest cluster. After several rounds of tuning (RocksDB has a *lot* of options to tune), the results looked promising:
These results gave us the confidence to move forward with RocksDB and invest in production-grade integration.
The data model for Manhattan is described in this blog post from 2018:
"Manhattan provides a data model where each record contains a partition key (PKey) and a local key (LKey) mapping to a value. A partition key, as the name suggests, is used to partition a record onto different nodes. A partition key can have multiple local keys that are sorted."
Additionally, because Manhattan is a multi-tenant system – that is, a single cluster is shared by multiple applications/datasets – a third key is needed to disambiguate between different datasets within the same cluster. So together, a Manhattan key consists of 3 components: dataset ID, PKey, LKey.
In addition to the user-provided value payload, Manhattan also maintains its own internal fields such as the timestamp of the last write/modification, an expiration timestamp (also known as “time to live”, or TTL), and a boolean flag that indicates whether a value is soft-deleted (called a “tombstone” flag). Manhattan by design does not rely on the storage engine to provide these values and instead maintains them itself at a higher abstraction level. This design choice is what enables Manhattan to have a pluggable storage engine.
RocksDB only supports raw bytes as keys and values. We encode the multi-component keys and values described above using a simple length-prefixed encoding: For each component, we first write out its size/length in bytes, followed by the serialized bytes of the component.
RocksDB's default key-ordering is based on a byte-wise comparison of keys. Since we have a custom encoding scheme for keys, the default byte-wise comparison of keys would not be logically correct. For example, consider two PKeys, abc and d, for the same dataset. In their physical encodings, the dataset ID would be the same for both. After that, the length-encoding bytes for abc would have the value 3, while the same bytes for d would have the value 1. A byte-wise comparison of length bytes would result in abc being greater than d since 3 is greater than 1, even though a logical comparison would normally have d be “greater than” abc.
To correctly implement the desired logical behaviour, we make use of RocksDB’s key comparator hook and override it with an implementation that decodes the logical components of a key and performs a component-wise logical comparison.
As described in the Value section above, Manhattan maintains its own application-level timestamp. This application-level timestamp is used to provide “last write wins” semantics to the user – In other words, a value with a higher Manhattan timestamp is “greater than” a value with a lower Manhattan timestamp, irrespective of which one was written last in RocksDB.
To achieve these semantics, we needed to define and use the Merge operator in RocksDB.
RocksDB doesn’t let you override the basic put/delete operations; instead, it provides an operation called Merge that allows applications to define custom read-modify-write semantics. The desired read-modify-write semantic for Manhattan can be defined in pseudocode as:
Replace existingValue if (existingValue.timestamp < newValue.timestamp)
In order to apply the above semantics for all writes, Manhattan exclusively uses the Merge operation for any write or mutate operations, and never uses the put/delete operation.
In order to make the Merge operation work, we implement our own custom Merge operator that implements the desired semantics. Given a list of candidate values (merge operands) for a key, it picks and returns the “latest” value, with "latest" being defined as the value with the largest Manhattan timestamp.
Most read queries to Manhattan are of the form:
Get everything under a (dataset ID, PKey)
Get a range/slice within a (dataset ID, PKey) from [LKey_start, LKey_end]
As you might have observed, both the query types share the same prefix (dataset, pkey). RocksDB allows you to configure a Prefix Extractor, and doing so enables RocksDB to use more efficient data structures that improve performance for such short-read use cases.
We built a prefix extractor that decodes the key and returns the bytes up until the end of the PKey as the “prefix”. In other words, the prefix consists of the dataset ID length, dataset ID bytes, PKey length, and PKey bytes.
As described in the Value section above, Manhattan maintains its own tombstone flag to indicate if a value is deleted at the application level, and a TTL field. As a result, it is Manhattan’s responsibility to implement its own “garbage collection” mechanism. RocksDB conveniently provides a compaction filter hook that allows applications to perform their own garbage collection.
Manhattan makes use of the hook and implements its own compaction filter that inspects values for tombstone and TTL fields, using this data to decide whether to keep or remove a value.
Barring the compaction filter, all of the components described above were implemented in C++ for performance reasons. The compaction filter, however, was implemented in Java for two reasons:
Java Native Interface(JNI) Attachment Mechanics
Unfortunately, our first Java implementation caused a slowdown in compaction performance. With significant help from the amazing JVM team at Twitter, we tracked this slowdown to the thread “attach” and “detach” steps that occur during JNI handoff:
For every key-value pair in the compaction input, the C++ compaction thread makes a JNI call that first “attaches” to the JVM. It then calls FilterV2, and finally “detaches” from the JVM. This attachment and detachment process is controlled via a mutex lock, which is the source of the slowdown.
To mitigate this, we removed the “detach” phase completely. This works because:
(Note, however, that the latter point is not true if you enable subcompactions, since subcompactions are executed on ad-hoc thread pools instead of from a fixed background thread pool.)
More details on this experimental approach can be found in this RocksDB pull request.
It took about a year for us to implement everything described above and finish productionizing the new RocksDB storage engine; this work was complete by mid-2018. Over the next two years, we migrated several Manhattan clusters to RocksDB, including clusters that hosted several of the core nouns at Twitter: Tweets, User Profiles, and Direct Messages (DMs). All of this data amounts to trillions of keys and Petabytes of storage.
Through this migration, we:
While the migration ended in a happy place, the journey has not been entirely smooth. Future blog posts will describe
It took a village to get this large migration project across the finish line.
Eng: Ben Clay, Vijay Teja Gottipati, Sowmya Pary, Jigar Bhati, Sahib Pandori, Donghua Liu, Kevin Yang, Justin Hendryx, Sean Ellis, Dhaval Patel
EMs: Rashmi Ramesh, Unmesh Jagtap, Karthik Katooru
Special thanks to Andrew Kryczka, who provided us with deep RocksDB knowledge and helped solve some of our technical challenges during the migration.
Did someone say … cookies?