Infrastructure

ZooKeeper at Twitter

By Michael Han
Thursday, 11 October 2018

ZooKeeper is a critical piece of Twitter’s infrastructure, one that is fundamental to the operation of our services. Like many systems at Twitter, ZooKeeper operates at a very large scale, with hundreds of ZooKeeper instances active at a given time.

How did we get there? It wasn’t overnight and it wasn’t always a smooth ride. This blog post describes how we use ZooKeeper, along with the challenges we’ve faced and lessons learned while operating ZooKeeper at large scale.

This Tweet is unavailable
This Tweet is unavailable.

What is ZooKeeper?

Apache ZooKeeper is a system for distributed coordination. It enables the implementation of a variety of primitives and mechanisms that are critical for safety and liveness in distributed settings, such as distributed locks, master election, group membership, and configuration management.

This Tweet is unavailable
This Tweet is unavailable.

Use cases

ZooKeeper is used at Twitter as the source of truth for storing critical metadata. It serves as a coordination kernel to provide distributed coordination services, such as leader election and distributed locking.

Some concrete examples of ZooKeeper in action include:

  • ZooKeeper is used to store service registry, which is used by Twitter’s naming service for service discovery.
  • Manhattan, Twitter’s in-house key-value database, stores its cluster topology information in ZooKeeper.
  • EventBus, Twitter’s pub-sub messaging system, stores critical metadata in ZooKeeper and uses ZooKeeper for leader election.
  • Mesos, Twitter’s compute platform, uses ZooKeeper for leader election.

Besides these major use cases, ZooKeeper is occasionally misused as a generic, strongly consistent, in-memory key-value store. However, this use case is usually an anti-pattern, and we have built tools to identify this use case when it occurs.

When ZooKeeper is used in this way, clients quickly hit scalability bottlenecks, forcing a different solution. Our experience indicates that ZooKeeper performs best when used only for storing small amounts of metadata, and serving mostly read workload out of the performance-critical path of client applications.

This Tweet is unavailable
This Tweet is unavailable.

Scaling ZooKeeper at Twitter

Like most other services at Twitter, and as mentioned earlier, ZooKeeper runs at large scale. We run many ZooKeeper clusters (known as “ensembles”), and each ensemble has many servers. However, unlike many higher-level services, ZooKeeper is stateful, and scaling a stateful service is not trivial.

One of the key design decisions made in ZooKeeper involves the concept of a session. A session encapsulates the states between a client and server. Sessions are the first bottleneck for scaling out ZooKeeper, because the cost of establishing and maintaining sessions is non-trivial. Each session establishment and removal is a write request and must go through the consensus pipeline. ZooKeeper does not scale well with write workload, and with hundreds of thousands of clients on each ZooKeeper host, the session establishment requests alone would keep the ensemble busy, preventing it from serving the usual read requests. We solved this with the local session feature introduced by engineers at Facebook, who faced similar issues. Currently on all Twitter’s ensembles, local sessions are used by default and will automatically be upgraded to global session when needed (e.g., client creates ephemeral nodes).

The next scaling bottleneck is serving read requests from hundreds of thousands of clients. In ZooKeeper, each client needs to maintain a TCP connection with the host that serves its requests. Sooner or later, the host will hit the TCP connection limit, given the growing number of clients. One part of solving this problem is the Observer, which allows for scaling out capacity for read requests without increasing the amount of consensus work. To further improve the stability of our ensembles, we often only allow Observers to serve direct client traffic, by removing the quorum members’ records from our DNS.

Besides these two major scaling improvements, we deployed the mixed read/write workload improvements and other small fixes from the community.

 

This Tweet is unavailable
This Tweet is unavailable.

Operating ZooKeeper at Twitter

Another critical aspect of scaling is operation scaling. Based on our experience, the keys to scaling operations are:

  • Observability: Good observability with multiple sources of insight into the ensembles is a must. We can’t operate ZooKeeper as a black box; such a complicated system requires good visibility to allow us to understand the cause of incidents when they occur. To gain insight into the state of the ensemble hosts, we have added many metrics to ZooKeeper, which are published to Twitter’s observability stack through Finagle. The metrics are integrated with our alerting system, which will notify engineers with various levels of alert severity.
  • Topology changes: Growing and shrinking an ensemble must be operationally easy, impose little impact on production traffic, and be safe. Twitter’s ZooKeeper ensembles run on bare metal hardware, however, and hardware can fail. When a host fails, we need to remove it from the ensemble and replace it with a new host. Another use case is adding new observer hosts to an ensemble to serve more traffic. How do we change the configuration of the ZooKeeper ensemble itself while continuing to serve traffic? Dynamic reconfiguration is here to help. It’s one of our favorite features in ZooKeeper 3.x. We have built tools for performing topology changes around dynamic reconfiguration. The tools are a great time-saver, and spare us from the disruption caused by the old way of performing these changes, through rolling restarts.
  • Backup: Data must be backed up such that we can restore the ensembles from total ensemble loss or accidental deletion. We back up by pushing the locally stored snapshots and transaction logs to remote storage (HDFS in our case), and using them as the source of restore. The backup module is coded as a relatively self-contained module in our ZooKeeper codebase, and the backup is running as a dedicated observer. This observer runs on Mesos and does not serve client requests. The backups provide a useful way to analyze how the ZooKeeper data tree’s structure has changed over time, for instance, in cases of runaway data growth due to a misbehaving client.
This Tweet is unavailable
This Tweet is unavailable.

What’s next

Currently we are running Apache ZooKeeper 3.5.2-alpha version, with many internal patches on top of it. We are working on rebasing our ZooKeeper code base to trunk, and working on contributing some of our internal patches upstream. As for feature work, we are very interested in improving availability of the ZooKeeper ensemble, including improving leader election time, adding rate limiting so the ensemble can degrade gracefully under high traffic, and ultimately supporting multi-tenancy so we can consolidate our ZooKeeper ensemble and make it more cost effective. We look forward to working with the Apache ZooKeeper community to push Apache ZooKeeper to the next level.

This Tweet is unavailable
This Tweet is unavailable.
@hanmsf

Michael Han

‎@hanmsf‎

Michael Han is a software engineer on the Twitter Platform Infrastructure team.