ZooKeeper is a critical piece of Twitter’s infrastructure, one that is fundamental to the operation of our services. Like many systems at Twitter, ZooKeeper operates at a very large scale, with hundreds of ZooKeeper instances active at a given time.
How did we get there? It wasn’t overnight and it wasn’t always a smooth ride. This blog post describes how we use ZooKeeper, along with the challenges we’ve faced and lessons learned while operating ZooKeeper at large scale.
Apache ZooKeeper is a system for distributed coordination. It enables the implementation of a variety of primitives and mechanisms that are critical for safety and liveness in distributed settings, such as distributed locks, master election, group membership, and configuration management.
ZooKeeper is used at Twitter as the source of truth for storing critical metadata. It serves as a coordination kernel to provide distributed coordination services, such as leader election and distributed locking.
Some concrete examples of ZooKeeper in action include:
Besides these major use cases, ZooKeeper is occasionally misused as a generic, strongly consistent, in-memory key-value store. However, this use case is usually an anti-pattern, and we have built tools to identify this use case when it occurs.
When ZooKeeper is used in this way, clients quickly hit scalability bottlenecks, forcing a different solution. Our experience indicates that ZooKeeper performs best when used only for storing small amounts of metadata, and serving mostly read workload out of the performance-critical path of client applications.
Like most other services at Twitter, and as mentioned earlier, ZooKeeper runs at large scale. We run many ZooKeeper clusters (known as “ensembles”), and each ensemble has many servers. However, unlike many higher-level services, ZooKeeper is stateful, and scaling a stateful service is not trivial.
One of the key design decisions made in ZooKeeper involves the concept of a session. A session encapsulates the states between a client and server. Sessions are the first bottleneck for scaling out ZooKeeper, because the cost of establishing and maintaining sessions is non-trivial. Each session establishment and removal is a write request and must go through the consensus pipeline. ZooKeeper does not scale well with write workload, and with hundreds of thousands of clients on each ZooKeeper host, the session establishment requests alone would keep the ensemble busy, preventing it from serving the usual read requests. We solved this with the local session feature introduced by engineers at Facebook, who faced similar issues. Currently on all Twitter’s ensembles, local sessions are used by default and will automatically be upgraded to global session when needed (e.g., client creates ephemeral nodes).
The next scaling bottleneck is serving read requests from hundreds of thousands of clients. In ZooKeeper, each client needs to maintain a TCP connection with the host that serves its requests. Sooner or later, the host will hit the TCP connection limit, given the growing number of clients. One part of solving this problem is the Observer, which allows for scaling out capacity for read requests without increasing the amount of consensus work. To further improve the stability of our ensembles, we often only allow Observers to serve direct client traffic, by removing the quorum members’ records from our DNS.
Besides these two major scaling improvements, we deployed the mixed read/write workload improvements and other small fixes from the community.
Another critical aspect of scaling is operation scaling. Based on our experience, the keys to scaling operations are:
Currently we are running Apache ZooKeeper 3.5.2-alpha version, with many internal patches on top of it. We are working on rebasing our ZooKeeper code base to trunk, and working on contributing some of our internal patches upstream. As for feature work, we are very interested in improving availability of the ZooKeeper ensemble, including improving leader election time, adding rate limiting so the ensemble can degrade gracefully under high traffic, and ultimately supporting multi-tenancy so we can consolidate our ZooKeeper ensemble and make it more cost effective. We look forward to working with the Apache ZooKeeper community to push Apache ZooKeeper to the next level.