When seconds really do matter

Friday, 11 March 2016

Behind unmarked doors in office buildings across the globe, engineers sit in front of walls covered in 60-inch LCD panels displaying graphs of many colors. Fueled mostly by caffeine, candy, and a wary eye, these teams are tasked with a complex job: ensuring all the services supporting Twitter are available, responsive, and functioning as expected — 24 hours a day, 365 days a year. This is the Twitter Command Center (referred to as “the TCC”) and this team is responsible for identifying and triaging any problems with the Twitter platform for our users, partners, and advertisers.

Of course, the definition of a “problem” can be very broad when operating such a dynamic and high traffic service. From the simplest latency spike due to the failure of a hard drive in a cluster, to a complex linux kernel timing bug to an undersea fiber optic cable being severed, the tools used to identify and resolve these and other issues have remained mostly the same: a minutely time series storage and graphing engine. From MRTG/RRD, Graphite, and Ganglia to Twitter’s own internal observability tool, the minutely time series has kept us in the know for 99% of the questions we’ve needed answered. But for Twitter, that 1% of information that minutely metrics miss is a very, very important 1% and we needed to address this gap.

On Twitter, the real world happens in sub minutely intervals and activity on the platform follows. Whether it’s a retirement announcement from a member of a boy band, a government coup, a world wide soccer tournament goal, or simply the start of a new year, traffic patterns change quickly and unpredictably as these events unfold. These are the moments when the minutely model becomes insufficient. Let’s imagine that a celebrity couple announces the birth of a child at two seconds after the nearest minute — for this example, let’s say 2:00:02. With this announcement, a spike of traffic causes undue pressure on one of the many micro services that powers Twitter, causing service errors to some users. With the standard monitoring platforms, it would be another 58 seconds (at 2:01) before we even had data on the problem, at which point, we could turn one of the many knobs at our disposal to attempt to resolve this problem, and then wait another 60 seconds to see if the knob did what we hoped it would. We’re now at 2:02. So that would be basically two minutes to resolve an issue if all things go perfectly.

At Twitter, we need to do better.

The metric used at Twitter to measure the health of a service is its success rate, or the percentage of requests to the service that were served successfully. This is simply expressed as (1-(failed_requests/total_requests))*100. To render this metric in real time, we need to aggregate our metrics on several dimensions: the time of the request on secondly boundaries, name of the cluster servicing the request, the category of the response code (2xx,3xx,4xx,5xx), and the zone from which the request was served. This sounds relatively simple, but when you consider the volume of requests Twitter receives, it becomes a bit trickier. Thankfully, the Twitter Data Platform team has developed tools to make solving this tricky problem a reality.

Using Twitter’s own TSAR and Heron services, we were able to create a job that consumes every request Twitter receives and aggregates on the dimensions mentioned above. With additional improvement, we were able to capture a statistically significant data sample size at a 5 second delay with a surprisingly small amount of compute resources. Leveraging TSAR’s HTTP API, we whipped up a simple front end using the simple yet amazing Smoothie Charts library. With this relatively easy implementation, we were able to knock down the best effort resolution time from 2 minutes to 10 seconds, a remarkable 12x improvement in problem identification and resolution.

Once we tackled the success rate problem, it made sense to add in other data aggregations that allow the team to identify a larger variety of problems much faster. Fields such as source IPs, user agents, user IDs, URIs requested, and application IDs are aggregated every second to form more specific secondly success rates. These metrics allow us to detect which users, networks, and applications are impacting our services and act accordingly to block and/or correct the behavior before it becomes an issue that impacts our users’ experience.

Shoutouts to Mansur Ashraf, Anirudh Todi, Allen Chen, Sanjeev Kulkarni, Karthik Ramasamy, and the rest of the Twitter Data Platform team for creating the frameworks that made building these tools possible.

This post was co-authored by Harry Kantas, Staff Reliability Engineer.