SuperRoot: Launching a High-SLA Production Service at Twitter

Friday, 22 July 2016

Our Search Infrastructure team is building a new information retrieval system called Omnisearch to power Twitter’s next generation of relevance-based, personalized products. We recently launched the first major architectural component of Omnisearch, the SuperRoot. We thought it would be interesting to share what’s involved with building, productionizing, and launching a new high-scale, high-SLA distributed system at Twitter.

For a person using Twitter’s consumer search product, it might look like Twitter has one search engine, but we actually have five indexes working together to serve many different products:

  • A high-scale in-memory index serving recent, public Tweets
  • A separate in-memory index to securely serve protected Tweets
  • A large SSD-based archival index serving all public Tweets ever
  • An in-memory typeahead index, for autocompleting queries
  • An in-memory user search index

We maintain several separate search indexes because each differs in terms of scale, latency requirements, usage patterns, index size and content. While they are all Finagle-based Thrift services that implement a shared API and query language, interfacing with multiple indexes is inconvenient for internal customers because they must understand which index to query for their use case and how to correctly merge results from different indexes.

SuperRoot: Launching a High-SLA Production Service at Twitter

For example, Twitter’s consumer search product displays a variety of types of content based on the user’s intent; for example, recent Tweets, high-quality older Tweets, relevant accounts, popular images, etc. As shown in Figure 1, we query multiple indexes to get all of the necessary content to build the page. Once we have results from each type of index, we merge them, removing duplicates. This isn’t as easy as it sounds: after removing duplicates, there may not be enough results to fill the page. With a static index, we could simply retry the query, asking for more results. However, some of our indexes are updated in realtime, meaning a retry of the same query may return different (fresher) results, which should be considered. Correctly implementing pagination across the boundaries of rapidly changing realtime indexes is even more challenging. With the architecture in Figure 1, the code to manage merging and pagination had to be replicated for each product, slowing product development.

Finally, consider the challenges of modifying the system shown in Figure 1: to add another index, merge two indexes, or change our API, we would have to work with every customer individually to avoid breaking their product. This overhead was slowing down the Search Infrastructure team’s progress on Omnisearch, which we believe is an important technology for increasing the pace of development of relevance-based products over the coming months.

Enter SuperRoot

“All problems in computer science can be solved by another level of indirection, except of course for the problem of too many indirections.”

David J. Wheeler

Recently, we launched SuperRoot, a scatter-gather Thrift aggregation service that sits between our customers and our indexes:

SuperRoot: Launching a High-SLA Production Service at Twitter

SuperRoot adds a layer of indirection to our architecture, presenting a consistent, logical view of the underlying decomposed, physical indices. By choosing to use the same API as each individual index’s root service, we were able to migrate existing customers seamlessly. It also adds less than 5ms of overhead to each request, which we think is a fair price given the functionality, usability and flexibility it affords.

The new functionality available in SuperRoot includes:

  • Efficient query execution, by only hitting the necessary indexes for a query
  • Merging across indexes of the same document type (e.g., Tweets)
  • Precise pagination across temporally tiered indexes (i.e., real-time to archive)
  • Per-customer quotas and rate limiting
  • Search-wide feature degradation for better availability (e.g. reduced quality under heavy load)
  • Improved access control, monitoring, security and privacy protections

For our internal customers, the SuperRoot means faster and easier development of new products. They no longer have to understand which indexes are available, write code to talk to each of them, reconcile duplicates, and implement complex pagination logic. Prototyping and experimentation using our search indexes is closer to self-serve. Debugging is also easier, both because their code is simpler and because they interface with fewer downstream services.

For the Search Infrastructure team, it creates a much needed layer of abstraction that will allow us to iterate faster on Omnisearch. We can now introduce a new, simple API to search in a single place. We can add or remove indexes without coordinating with every internal customer. We can change index boundaries (e.g., increasing or decreasing the depth of the real-time Tweet index) without breaking consumer products. In the extreme, we can even experiment with new architectures and search technologies behind the scenes.

The Road to Production

Now that you understand what SuperRoot is and why we wanted it, you might be thinking that adding an additional layer of indirection to a distributed system isn’t a new idea. This is true! This project was not about novel architectures, creative data structures or complex distributed algorithms. What was interesting about the SuperRoot project was how we used the technologies available at Twitter to develop and deploy a new high-scale distributed system without impacting our customers.

SuperRoot has capacity to serve over 85,000 queries per second (QPS) per datacenter. It services not only Twitter’s Search product, but also our Home Timeline, our data products, and many other core Twitter features. With this much traffic, small mistakes in implementation, scaling issues, or deploy problems would have an outsized impact on our users. Yet, we were able to launch SuperRoot to 100% of customers without a single production incident or unhappy internal customer. This is delicate work and it’s a major aspect of what we do as Platform engineers at Twitter.

SLAs: Latency, Throughput, and Success Rate

In addition to functionality, our internal customers will usually specify basic requirements for serving their product, which we commit to in the form of a Service-Level Agreement (SLA). For SuperRoot customers, the most important aspects of the SLA are query latency (measured in milliseconds), query throughput (measured in QPS) and success rate (the percentage of queries that succeed per second).

Minimizing Overhead with Finagle and Thrift

SuperRoot aggregates the results from different indexes, some of which serve from RAM with very low latencies. For example, the in-memory index that serves recent, public Tweets has p50 latencies of under 10ms (for simpler queries) and p999 latencies under 100ms (for expensive queries). Since low latencies are required for many of our products, it was important that SuperRoot not add a lot of overhead.

SuperRoot relies on Finagle’s RPC system and the Thrift protocol to ensure low overhead. One way that Finagle minimizes overhead is through RPC multiplexing, meaning there is only one network connection per client-server session, regardless of the number of requests from the client. This minimizes use of bandwidth and reduces the need to open and close network sockets. Finagle clients also implement load balancing, meaning requests are distributed across the many SuperRoot instances in an attempt to maximize success rate and minimize tail latencies. For SuperRoot, we chose the Power of Two Choices (P2C) + Peak EWMA load balancing algorithm, which is designed to quickly move traffic off of slow endpoints. Finally, the choice of Thrift reduces overhead in terms of bandwidth and CPU for serialization and deserialization, compared to a non-binary protocol like JSON.

Ensuring Sufficient Throughput

Not all queries are created equal: simple realtime queries take only a few milliseconds while hard tail queries can take hundreds of milliseconds and hit multiple indexes. At Twitter, we do “redline testing,” using real traffic to accurately measure the real-world throughput of a service. These tests work by incrementally increasing load balancer weights for instances under test until key metrics (e.g. latency, CPU or error rate) hit a predetermined limit. Redline tests at Twitter can be run via a self-serve UI and are available to all teams. Using these tests, we determined (and later adjusted) the number of instances required for SuperRoot to serve 85,000 real-world QPS.

Reliability

Once SuperRoot had the required functionality and was provisioned to handle the estimated load, we needed to make sure it was stable. We measure the stability of our systems with a metric called “success rate,” which is simply the number of successful requests divided by the total number of requests, expressed as a percentage. For SuperRoot, we were looking for a 99.97% success rate, which means we tolerate no more than 3 failures per 10,000 queries. During normal operation, our success rate is usually significantly higher, but we consider anything lower an “incident.”

A shared cloud is a critical tool for reliably operating large distributed systems. Twitter uses Apache Mesos and Apache Aurora to schedule and run stateless services like SuperRoot. Mesos abstracts CPU, memory, storage, and other compute resources away from the actual hardware. Aurora is responsible for scheduling (and rescheduling) jobs onto healthy machines and keeping them running indefinitely. Given number of SuperRoot instances we need, hardware failures are expected on a daily basis. We would not be able to attain a 99.97% success rate if for every hardware failure, we had to manually take corrective action. Thankfully, Finagle routes traffic away from bad instances while Mesos and Aurora reschedule jobs quickly and automatically.

One interesting and unexpected issue we encountered when building SuperRoot manifested itself as a periodic success rate dip, occurring every few minutes. On the surface, it looked like the dips were caused multiple sequential timeouts when querying in-memory Earlybird indexes. We naturally suspected an issue with Earlybird, but Earlybird’s graphs showed consistent response times of under 10ms. On deeper inspection, we found that the requests were timing out before they were sent to Earlybird, implicating SuperRoot itself.

Often times, when debugging distributed systems, it is hard to tell the difference between cause and effect without some experimentation. For example, we noticed slight CPU throttling by Mesos around the timeout events, so we tried allocating more CPU. This did not help, indicating it was an effect, not the cause. With our long-running JVM-based services, we often suspect garbage collection (GC), but we didn’t see any correlation between GC events and timeout events in our logs. However, when inspecting the logs, we did notice that the logs themselves were being printed during the events! From this observation, we were able to trace the issue back to a release of a new Twitter-specific JVM. With the smoking gun in hand, we worked with our VM team to identify synchronous GC logging in the JVM as the culprit. The VM team implemented asynchronous logging and the issue disappeared, clearing the SuperRoot for launch.

Getting to Perfect

For SuperRoot to reach its full potential, we needed every customer of the search infrastructure to use it. For this to happen, we needed to guarantee that the results from SuperRoot exactly matched what each customer expected. Before SuperRoot, most customers were directly hitting the roots of individual indexes (see Figure 1). With the introduction of SuperRoot, the responsibility of hitting multiple indexes and merging their results moved to it, meaning that any mistake made in the merging logic would directly manifest itself as a bug or quality regression in one of our products.

The process of getting to feature parity started with understanding our customers’ needs, typically by reading their code and talking to them. We then wrote unit tests, implemented an initial version, and deployed it. To verify the correctness of a new implementation of an existing system, we used a technique we call “tap-compare.” Our tap-compare tool replays a sample of production traffic against the new system and compares the responses to the old system. Using the output of the tap-compare tool, we found and fixed bugs in our implementation without exposing end customers to the bugs. This allowed us to migrate customers one-by-one to SuperRoot without incident.

In one case, in an effort to reduce complexity, we didn’t want to recreate the exact logic in the customer’s system. We suspected that the simplification we had in mind would have a subtle impact on the Home timeline’s filtering algorithm, but we didn’t know if the effect would be positive or negative. Tap-compare techniques don’t help when you’re not looking for exact feature parity, so we instead chose to A/B test the effect on the Home timeline. Given the high-stakes nature of changing the Home timeline, we felt the added time and complexity of running a proper A/B test was prudent. Ultimately, we found that our simplification reduced end-user engagement, and so we abandoned it in favor of a more complex implementation.

#ShipIt

Due to the incremental nature of our development process, there was no single day when we launched the SuperRoot. Instead, we shipped each request type one at a time, ensuring quality and correctness along the way. In our retrospective, the team was particularly proud by how smoothly this project went. There were no incidents, no unhappy customers, we cleared our backlog, and there were plenty of learning opportunities for the team members, many of whom had never built a distributed system from the ground up.

Acknowledgements
The core SuperRoot team was Dumitru Daniliuc (@twdumi), Bogdan Gaza (@hurrycane), and Jane Wang (@jane12345689). Other contributors included Paul Burstein (@pasha407), Hao Wu, Tian Wang (@wangtian), Xiaobing Xue (@xuexb), Vikram Rao Sudarshan (@raosvikram), Wei Li (@alexweili), Patrick Lok (@plok), Yi Zhuang (@yz), Lei Wang (@wonlay), Stephen Bezek (@SteveBezek), Sergey Serebryakov (@megaserg), Yan Zhao (@zhaoyan1117), Joseph Barker (@seph_barker), Maer Melo (@maerdot), Mark Sparhawk (@sparhawk), Dean Hiller (@spack_jarrow), Sean Smith (@seanachai), and Sam Luckenbill (@sam).