Libcrunch and CRUSH Maps

Wednesday, 19 June 2013

When we introduced our in-house photo storage system Blobstore to the world, we discussed a mapping framework called libcrunch for Blobstore that maps virtual buckets to storage nodes. The libcrunch implementation was heavily inspired by the seminal paper on the CRUSH algorithm. Today, we are open-sourcing libcrunch on GitHub under the Apache Public License 2.0 to share some of our code with the greater community.

Why libcrunch?

In developing Blobstore, we knew we wanted a mapping framework that can:

support a flexible topology definition
support placement rules such as rack diversity
handle replication factor (RF)
distribute data with a good balance
cause stable data movement as a result of topology change (when a small change in the topology occurs, the resulting data movement should be similarly small)
exhibit a good mean time to recovery when a node fails

In addition to these fairly standard requirements, we zeroed in on another critical factor, which we call Replica Distribution Factor (RDF). In our initial research, we didn’t find any open source libraries that met our needs on the JVM, so we developed our own. There are mapping algorithms that may satisfy some of these criteria (such as consistent hashing) but none satisfies all of them, especially RDF.

What is RDF?

RDF is defined as the number of data nodes that share any data with a node. To understand RDF, you can look at one data node and how many other data nodes share any data with that node. In an extreme case, you can imagine data mapping where each data node is completely replicated by another data node. In this case, RDF would be 1. In another extreme case, every single data node may participate in having replicas of every other node. In that case, RDF would be as large as the size of the entire cluster.

The key concern RDF seeks to address is the (permanent) loss of any data. It would be useful to think about the following scenario. Suppose the replication factor is 2 (2 replicas for each datum). And suppose that we lost one data node for any reason (disk failures and loss of racks). Then the number of replicas for any data that was on that lost data node is down to 1. At this point, if we lose any of those replicas, that piece of data is permanently lost. Assuming that the probability of losing one data node is small and independent (a crude but useful approximation), one can recognize that the probability of losing any data increases proportionally with the number of data nodes that share data with the lost node. And that is the definition of RDF. The bigger the RDF the bigger the probability of losing any data in case of data node loss. By tuning RDF down to a smaller number, one can mitigate the probability of permanent data loss to an acceptable level.

As you can imagine, RDF becomes much more relevant if the replication factor (RF) is small. One can adopt a larger RF to address the risk of data loss but it would come at a cost, and a prohibitively expensive one if the data size is large.

Libcunch is designed to deliver these functionalities, including RDF.

How does it work?

The libcrunch implementation uses the basic CRUSH algorithm as the building block of how it computes mapping. The CRUSH algorithm provides a number of functionalities that are mentioned in the paper. By using this algorithm to store and retrieve data, we can avoid a single point of failure and scale easily.

To be able to limit the size of the RDF, we use a two-pass approach. In the first pass, we compute what we call the RDF mapping using the same cluster topology but using each data node (or its identifier) as the data. This way, we can come up with a fairly well-defined RDF set from which data mapping can be handled later. In the second pass, we compute the actual data mapping. But for a given data object, we don’t use the full cluster to select the data nodes. Instead, we limit the selection to one of those RDF set we computed in the first pass.

Example code

A mapping function is needed when you have a number of data objects you want to distribute to a number of nodes or containers. For example, you may want to distribute files to a number of storage machines. But it may not need to be limited to physical storage. Any time logical data is mapped to a logical container, you can use a mapping function.

Creating and using the libcrunch mapping functions are pretty straightforward. The key part is to implement the placement rules you desire (such as rack isolation rules), and set up your cluster topology in terms of the type Node provided by libcrunch. Then you get the mapping result via the MappingFunction.computeMapping() method. For example:

// set up the placement rules
PlacementRules rules = createPlacementRules();

// instantiate the mapping function
MappingFunction mappingFunction = new RDFMapping(rdf, rf, rules, targetBalance);

// prepare your data
List<Long> data = prepareYourDataIds();

// set up the topology
Node root = createTopology();

// compute the mapping
Map<Long,List<Node>> mapping = mappingFunction.computeMapping(data, root);

Future work

In the near future, we look forward to improving documentation and nurturing a community around libcrunch. We are also constantly looking for ways to improve various aspects of the algorithm such as balance and stability. We are also planning to adopt libcrunch in other storage systems we are developing at Twitter.

If you’d like to help work on any features or have any bug fixes, we’re always looking for contributions or people to join the flock to build out our core storage technology. Just submit a pull request to say hello or reach out to us on the mailing list. If you find something broken or have feature request ideas, report it in the issue tracker.

Acknowledgements

Libcrunch is a team effort by Twitter’s Core Storage team (@corestorage). It was primarily authored by Jerry Xu (@jerryxu) and Sangjin Lee (@sjlee). The idea for libcrunch came out of discussions by Peter Schuller (@scode), Boaz Avital (@bx) and Chris Goffinet (@lenn0x), who also has made a number of direct contributions to the current manifestation. We’d also like to thank Stu Hood (@stuhood) for his invaluable feedback and contributions.