Caching with Twemcache

Tuesday, 10 July 2012

Update - July 11, 2012, 9:45am

We want to correct an error regarding the slab calcification problem we mentioned in the original post. This problem only applied to our v1.4.4 fork of Memcached; this correction is reflected below. The recent Memcached version has addressed some of these problems.

We built Twemcache because we needed a more robust and manageable version of Memcached, suitable for our large-scale production environment. Today, we are open-sourcing Twemcache under the New BSD license. As one of the largest adopters of Memcached, a popular open source caching system, we have used Memcached over the years to help us scale our ever-growing traffic. Today, we have hundreds of dedicated cache servers keeping over 20TB of data from over 30 services in-memory, including crucial data such as user information and Tweets. Collectively these servers handle almost 2 trillion queries on any given day (that’s more than 23 million queries per second). As we continued to grow, we needed a more robust and manageable version of Memcached suitable for our large scale production environment.

We have been running Twemcache in production for more than a year and a half. Twemcache is based on a fork of Memcached v1.4.4 that is heavily modified to improve maintainability and help us monitor our cache servers better. We improved performance, removed code that we didn’t find necessary, refactored large source files and added observability related features. The following sections will provide more details on why we did this and what those new features are.

Motivation

Almost all of our cache use cases fall into two categories:

as an optimization for disk where cache is used as the in-memory serving layer to shed load from databases.
as an optimization for cpu where cache is used as a buffer to store items that are expensive to recompute.

An example of these two optimizations is “caching of Tweets”. All Tweets are persisted to disk when they are created, but most Tweets requested by users need to be served out of memory for performance reasons. We use Twemcache to store recent and frequently accessed Tweets, as an optimization for disk. When a Tweet shows up in a particular client, it takes a particular presentation - rendered Tweet - which has other metadata like number of retweets, favorites etc. We also use Twemcache to store the recently rendered Tweets, as an optimization for cpu.

To effectively address the use cases mentioned above, it’s extremely important that caches are always available and have predictable performance with respect to item hit rate even when operating at full capacity. Caches should also be able to adapt to changing item sizes on-the-fly as application data size grows or shrinks over time. Finally, it is critical to have observability into caches to monitor the health and effectiveness of our cache clusters. It turns out that all these problems are interrelated because adapting to changing item sizes usually requires a cache reconfiguration — which impacts availability and predictability. Twemcache tries to address these needs with the help of the following features:

Random Eviction

The v1.4.4 implementation of Memcached, which Twemcache is based on, suffers from a problem we call slab calcification. In Memcached, a slab can only store items of a given maximum size and once a slab has been allocated to a slab class, it cannot be reassigned to another slab class. In other words, slabs once allocated are locked to their respective slab classes. This is the crux of the slab calcification problem. When items grow or shrink in size, new slabs must be to allocated to store them. Over time, when caches reach full memory capacity, to store newer items we must rely on evicting existing items in the same slab class. If the newer items are of a size with no slabs allocated, write requests may fail completely. Meanwhile, slabs allocated to a different slab class may sit idle. Slab calcification leads to loss of capacity and efficiency.

To solve this problem without resorting to periodically restarting the server instances, we introduced a new eviction strategy called random eviction. In this strategy, when a new item needs to be inserted and it cannot be accommodated by the space occupied by an expired item or the available free memory, we’ll simply pick a random slab from the list of all allocated slabs, evict all items within that slab, and reallocate it to the slab class that fits the new item.

It turns out that this feature is quite powerful for two reasons:

Cache servers can now gracefully move on-the-fly from one slab size to another for a given application. This enables our cache servers to adapt to changing item sizes and have a predictable long term hit rate by caching an application’s active working set of items.
Application developers don’t have to worry about reconfiguring their cache server when they add or delete fields from their cache item structures or if their item size grows over time.

By providing a stable hit rate, random eviction prevents performance degradation due to data pattern change and system instability associated with restarts. The video below illustrates how over time Twemcache is able to adapt to a shifting size pattern and still remain effective.

Lock-less Stats Collection

Cache observability enables us to monitor the health of our cache clusters and ensure that applications are using them effectively. To address this need, we redesigned the Memcached stats module. Similar to the findings in Facebook’s attempt to scale Memcached, we found that the global statistics lock was a main contention point.

This motivated us to use an updater-aggregator model of thread synchronization, in which worker threads always update thread-local metrics, and a background aggregator thread asynchronously collects metrics from all threads periodically holding only one thread-local lock at a time. Once aggregated, stats polling comes for free. Removing a global lock reduces the time Twemcache spends in a unresponsive state. There is a slight trade-off between how up-to-date stats are and how much burden stats collection puts on the system. However, the difference in total mutex wait time between aggregating once and 100 times per second is under 20%, and the impact on performance is totally predictable and thread-local. On top of making stats collection scalable, we also systematically reviewed the metrics, and came up with a more comprehensive list of metrics: Memcached provides 48 global metrics, 18 slab metrics and 10 item stats; Twemcache, on the other hand, provides 74 global metrics and 39 slab metrics. We merged item metrics into slab metrics to further simplify stats collection.

Asynchronous Command Logger

When using Memcached, one of the hardest problems we faced was the hit-rate and memory-footprint trade-off - the sweet spot for achieving the desired performance gain with reasonable resources, as it is typically not possible to keep the entire data set in memory. To pinpoint the minimum memory requirement for a given hit rate, we needed a way to systematically analyze an application’s data access pattern. To address this need, we implemented a new feature called command logger in Twemcache. When turned on, each worker thread will record a time stamped command header as well as return status, as shown below:

Each line of the command log gives precise information on the client, the time when a request was received, the command header including the command, key, flags and data length, a return code, and reply message length. In fact, the only thing missing is the item value itself, which turns out to be unimportant for our analysis.

The command logger supports lockless read/write into ring buffers. Each worker thread logs into a thread-local buffer as they process incoming queries, and a background thread asynchronously dumps buffer contents to either a file or a socket. Thus the overhead on worker threads is minimal and so would not affect the service availability. The logger has been tested to log at about 100k requests-per-second. To control the speed of log generation, the command logger also supports sampling. Once we know what keys are accessed, the way they are accessed, and their return status, we can perform offline data analysis to estimate optimal working set size, item heat map, etc.

Future work

Twemcache is the result of our effort to turn Memcached into a reliable building block in Twitter’s data infrastructure. We kept the simplicity of the Memcached protocol intact, but made the service more dependable and more informative with Twemcache, without sacrificing performance. While we initially focused on the challenging goal of making Memcached work extremely well within the Twitter infrastructure, we look forward to sharing our code and ideas with the Memcached community in the long term.

In the near future, we plan to evolve Twemcache in the open, address the hashtable lock contention issue that would further improve scalability, support more eviction strategies, support bootstrapping the cache from disk and provide a complete set of real-time data analysis tools. To view the source code and share feedback, please visit the Twemcache GitHub page. You can also follow Twemcache’s Twitter account (@Twemcache) for updates. We would love to hear any ideas you have in improving Twemcache via pull requests or issues. Or better yet, why not consider joining the flock (@jointheflock) if you want to help build a world class caching system?

Other work: Twemproxy

Twemcache is one of the building blocks that comprise the caching system at Twitter. Another fundamental building block in our caching system is Twemproxy, a proxy for memcached protocol that we recently open sourced. Twemproxy minimizes the connections to our backend caching servers and enables us to scale horizontally. Furthermore, we are also actively developing the client side of our caching system on top of the Twitter Finagle stack.

Acknowledgements

Twemcache was primarily engineered by Manju Rajashekhar (@manju) and Yao Yue (@thinkingfish). In addition, we’d like to acknowledge the following folks who contributed to the project either directly or indirectly and its deployment and maintenance in our datacenters: Anirudh Srinivas (@asrin), David Lam (@kkdlam), Krishna Gade (@krishnagade), Joshua Coats (@shu), Owen Vallis (@o_e_bert), Rob Benson (@rgbenson), Brandon Mitchell (@bitbckt) and Xin Xiang (@xiangxin72).

- Chris Aniszczyk, Manager of Open Source (@cra)