The real-time nature of Twitter poses unique and challenging issues for engineering teams at Twitter. We need to quickly surface breaking news, serve relevant advertisements to users, and address many other real-time use cases. Twitter’s Pub/Sub system provides the infrastructure for Twitter teams to handle this workload.
The Messaging team at Twitter has been running an in-house Pub/Sub system, EventBus (built on top of Apache DistributedLog), for the last few years but we’ve recently made the decision to pivot toward Apache Kafka, both migrating existing use cases and onboarding new use cases. In this blog post, we will discuss why we chose to adopt Kafka as Twitter’s Pub/Sub system as well as the different challenges we faced along the way.
Apache Kafka is an open-source distributed streaming platform that enables data to be transferred at high throughput with low latency. Kafka was originally conceived at LinkedIn and open-sourced in 2011, and has since seen broad adoption from the community, including at other companies, making it the de facto real-time messaging system of choice in the industry.
At its core, Kafka is a Pub/Sub system built on top of logs, with many desirable properties, such as horizontal scalability and fault tolerance. Kafka has since evolved from a messaging system to a full-fledged streaming platform (see Kafka Streams).
You may be wondering why Twitter chose to built an in-house messaging system in the first place. Twitter actually used Kafka (0.7) a few years ago, but we found several issues that made it unsuitable for our use cases — mainly the number of I/O operations made during catchup reads and lack of durability and replication. However, both hardware and Kafka have since come a long way and these concerns have now been addressed. Improvements have been made in hardware that have made SSDs cheap enough to use, which has helped with previous I/O issues on random reads we saw on HDDs, and server NICs have much more bandwidth, making it less compelling to split the serving and storage layers (which EventBus does). Additionally, newer versions of Kafka now support data replication, providing the durability guarantees we wanted.
Migrating all of Twitter’s Pub/Sub use cases to a completely new system is going to be an expensive process. So, naturally, the decision to move to Kafka was not spontaneous, but rather carefully planned and data-driven. The motivation to move to Kafka can be summarized with two main reasons: cost and community.
Before the decision to move to Kafka was announced across the company, our team spent several months evaluating Kafka under similar workloads that we run on EventBus — durable writes, tailing reads, catchup reads, and high fanout reads, as well as some grey failure scenarios (e.g. slowing down a specific broker in the cluster).
In terms of performance, we saw that Kafka had significantly lower latency, regardless of the amount of throughput as measured by the timestamp difference from the time the message was created to when the consumer read the message.
P99 Latency comparison between EventBus and Kafka at different throughput
This can be attributed to several factors, potentially including but not limited to:
Architecture comparison between EventBus (left) and Kafka (right)
From a cost standpoint, EventBus requires hardware for both the serving layer (optimized for high network throughput) and the storage layer (optimized for disk), while Kafka uses a single host to provide both. As a result, EventBus requires more machines to serve the same workload than Kafka. For single consumer use cases, we saw a 68% resource savings, and for fanout cases with multiple consumers we saw a 75% resource savings. One catch to this is that for extremely bandwidth-heavy workloads (very high fanout-reads), EventBus theoretically might be more efficient since we can scale out the serving layer independently. However, we’ve found in practice that our fanout is not extreme enough to merit separating the serving layer, especially given the bandwidth available on modern hardware.
As mentioned above, Kafka has been widely adopted. This helps us, first, by letting us leverage the bug fixes, improvements, and new features that hundreds of developers are contributing to the Kafka project, as opposed to the eight or so engineers working on EventBus/DistributedLog. Furthermore, many of the features that our customers at Twitter have wanted in EventBus have already been built out in Kafka, such as a streaming library, at-least-once HDFS pipeline, and exactly-once processing.
Additionally, when we run into problems on either the client or the server, we can easily find solutions by quickly searching through the web since it is likely that someone else has encountered the same issue. Similarly, documentation of well-adopted projects are usually far more exhaustive than documentation for less popular projects.
Another important aspect of adopting and contributing back to a popular project such as Kafka is for recruiting purposes. On one hand, by contributing back to Kafka, people get visibility into Twitter’s engineering. On the other hand, hiring engineers for the team is a lot easier since new engineers are already familiar with the technology. This removes any necessary ramp-up time that would be needed with EventBus.
As great as moving to Kafka sounds, it wasn’t smooth sailing. We faced numerous technical challenges as well as adaptive challenges in the process.
From a technical standpoint, some of the challenges we encountered included configuration tuning and the Kafka Streams library. Like many distributed systems, there were an enormous number of configurations that needed to be fine-tuned in order to support Twitter’s real-time use case. While running Kafka Streams, we found some issues with metadata size in the Kafka Streams library that was caused by stale clients persisting their metadata even after they were shut down.
On the other hand, Kafka has architectural differences from EventBus that required us to configure the system and debug issues differently. One example of this is how replication is done in EventBus (quorum writes) versus Kafka (master slave replication). Write requests are sent in parallel in EventBus while Kafka requires the slave nodes to replicate the write request only after the master has received the write request. Additionally, the durability model between the two systems is very different — EventBus only acknowledges a write once the data is persisted (fsync’d) to disk, while Kafka makes the case that replication itself will guarantee durability and will acknowledge a write request before the data is durably stored on disk. We’ve had to rethink our definition of data durability to encompass the fact that the likelihood of all three replicas of the data failing at once is so unlikely that it isn’t necessary to fsync data in each replica to provide the durability guarantees we wanted to offer.
In the next few months, our plan is to migrate our customers from EventBus to Kafka, which will help reduce the costs to operate Twitter’s Pub/Sub system as well as enable our customers to use the additional features Kafka provides. We will continuously keep an eye on different messaging and streaming systems in the ecosystem and ensure that our team is making the right decision for our customers and Twitter, even if it’s a hard decision.
If you are interested in working with teams that maintain or use this project, take a look at our open roles.