The Anatomy of a Whale

Wednesday, 10 February 2010

Sometimes it’s really hard to figure out what’s causing problems in a web site like Twitter. But over time we have learned some techniques that help us to solve the variety of problems that occur in our complex web site.

A few weeks ago, we noticed something unusual: over 100 visitors to Twitter per second saw what is popularly known as “the fail whale”. Normally these whales are rare; 100 per second was cause for alarm. Although even 100 per second is a very small fraction of our overall traffic, it still means that a lot of users had a bad experience when visiting the site. So we mobilized a team to find out the cause of the problem.

What Causes Whales?

What is the thing that has come to be known as “the fail whale”? It is a visual representation of the HTTP “503: Service Unavailable” error. It means that Twitter does not have enough capacity to serve all of its users. To be precise, we show this error message when a request would wait for more than a few seconds before resources become available to process it. So rather than make users wait forever, we “throw away” their requests by displaying an error.

This can sometimes happen because too many users try to use Twitter at once and we don’t have enough computers to handle all of their requests. But much more likely is that some component part of Twitter suddenly breaks and starts slowing down.

Discovering the root cause can be very difficult because Whales are an indirect symptom of a root cause that can be one of many components. In other words, the only concrete fact that we knew at the time was that there was some problem, somewhere. We set out to uncover exactly where in the Twitter requests’ lifecycle things were breaking down.

Debugging performance issues is really hard. But it’s not hard due to a lack of data; in fact, the difficulty arises because there is too much data. We measure dozens of metrics per individual site request, which, when multiplied by the overall site traffic, is a massive amount of information about Twitter’s performance at any given moment. Investigating performance problems in this world is more of an art than a science. It’s easy to confuse causes with symptoms and even the data recording software itself is untrustworthy.

In the analysis below we used a simple strategy that involves proceeding from the most aggregate measures of system as a whole and at each step getting more fine grained, looking at smaller and smaller parts.

How is a Web Page Built?

Composing a web page for Twitter request often involves two phases. First data is gathered from remote sources called “network services”. For example, on the Twitter homepage your tweets are displayed as well as how many followers you have. These data are pulled respectively from our tweet caches and our social graph database, which keeps track of who follows whom on Twitter. The second phase of the page composition process assembles all this data in an attractive way for the user. We call the first phase the IO phase and the second the CPU phase. In order to discover which phase was causing problems, we checked data that records what amount of time was spent in each phase when composing Twitter’s web pages.

The green line in this graph represents the time spent in the IO phase and the blue line represents the CPU phase. This graph represents about 1 day of data. You can see that the relationships change over the course of the day. During non-peak traffic, CPU time is the dominant portion of our request, with our network services responding relatively quickly. However, during peak traffic, IO latency almost doubles and becomes the primary component of total request latency.

Understanding Performance Degradation

There are two possible interpretations for this ratio changing over the course of the day. One possibility is that the way people use Twitter during one part of the day differs from other parts of the day. The other possibility is that some network service degrades in performance as a function of use. In an ideal world, each network service would have equal performance for equal queries; but in the worst case, the same queries actually get slower as you run more simultaneously. Checking various metrics confirmed that users use Twitter the same way during different parts of the day. So we hypothesize that the problem must be in a network service degrading poorly. We were still unsure; in any good investigation one must constantly remain skeptical. But we decided that we had enough information to transition from this more general analysis of the system into something more specific, so we looked into IO latency data.

This graph represents the total amount of time waiting for our network services to deliver data. Since the amount of traffic we get changes over the course of the day, we expect any total to vary proportionally. But this graph is actually traffic independent; that is, we divide the measured latency by the amount of traffic at any given time. If any traffic-independent total latency changes over the course of the day, we know the corresponding network service is degrading with traffic. You can see that the purple line in this graph (which represents Memcached) degrades dramatically as traffic increases during peak hours. Furthermore, because it is at the top of the graph it is also the biggest proportion of time waiting for network services. So this correlates with the previous graph and we now have a stronger hypothesis: Memcached performance degrades dramatically during the course of the day, which leads to slower response times, which leads to whales.

This sort of behavior is consistent with insufficient resource capacity. When a service with limited resources, such as Memcached, is taxed to its limits, requests begin contending with each other for Memcached’s computing time. For example, if Memcached can only handle 10 requests at a time but it gets 11 requests at time, the 11th request needs to wait in line to be served.

Focus on the Biggest Contributor to the Problem

If we can add sufficient Memcached capacity to reduce this sort of resource contention, we could increase the throughput of Twitter.com substantially. If you look at the above graph, you can infer that this optimization could increase twitter performance by 50%.

There are two ways to add capacity. We could do this by adding more computers (memcached servers). But we can also change the software that talks to Memcached to be as efficient with its requests as possible. Ideally we do both.

We decided to first pursue how we query Memcached to see if there was any easy way to optimize that by reducing the overall number of queries. But, there are many types of queries to memcached and it might be that some may take longer than others. We want to spend our time wisely and focus on optimizing the queries that are most expensive in aggregate.

We sampled a live process to record some statistics on which queries take the longest. The following is each type of Memcached query and how long they take on average:

get         0.003s
get_multi   0.008s
add         0.003s
delete      0.003s
set         0.003s
incr        0.003s
prepend     0.002s

You can see that get_multi is a little more expensive than the rest but everything else is the same. But that doesn’t mean it’s the source of the problem. We also need to know how many requests per second there are for each type of query.

get         71.44%
get_multi    8.98%
set          8.69%
delete       5.26%
incr         3.71%
add          1.62%
prepend      0.30%

If you multiply average latency by the percentage of requests you get a measure of the total contribution to slowness. Here, we found that gets were the biggest contributor to slowness. So, we wanted to see if we could reduce the number of gets.

Tracing Program Flow

Since we make Memcached queries from all over the Twitter software, it was initially unclear where to start looking for optimization opportunities. Our first step was to begin collecting stack traces, which are logs that represent what the program is doing at any given moment in time. We instrumented one of our computers to sample some small percentages of get memcached calls and record what sorts of things caused them.

Unfortunately, we collected a huge amount of data and it was hard to understand. Following our precedent of using visualizations in order to gain insight into large sets of data, we took some inspiration from the Google perf-tools project and wrote a small program that generated a cloud graph of the various paths through our code that were resulting in Memcached Gets. Here is a simplified picture:

Each circle represents one component/function. The size of the circle represents how big a proportion of Memcached get queries come from that function. The lines between the circles show which function caused the other function to occur. The biggest circle is check_api_rate_limit but it is caused mostly by authenticate_user and attempt_basic_auth. In fact, attempt_basic_auth is the main opportunity for enhancement. It helps us compute who is requesting a given web page so we can serve personalized (and private) information to just the right people.

Any Memcached optimizations that we can make here would have a large effect on the overall performance of Twitter. By counting the number of actual get queries made per request, we found that, on average, a single call to attempt_basic_authwas making 17 calls. The next question is: can any of them be removed?

To figure this out we need to look very closely at the all of the queries. Here is a “history” of the the most popular web page that calls attempt_basic_auth. This is the API request for http://twitter.com/statuses/friends_timeline.format, the most popular page on Twitter!

get(["User:auth:missionhipster",                       # maps screen name to user id
get(["User:15460619",                                  # gets user object given user id (used to match passwords)
get(["limit:count:login_attempts:...",                 # prevents dictionary attacks
set(["limit:count:login_attempts:...",                 # unnecessary in most cases, bug
set(["limit:timestamp:login_attempts:...",             # unnecessary in most cases, bug
get(["limit:timestamp:login_attempts:...",
get(["limit:count:login_attempts:...",                 # can be memoized
get(["limit:count:login_attempts:...",                 # can also be memoized
get(["user:basicauth:...",                             # an optimization to avoid calling bcrypt
get(["limit:count:api:...",                            # global API rate limit
set(["limit:count:api:...",                            # unnecessary in most cases, bug
set(["limit:timestamp:api:...",                        # unnecessary in most cases, bug
get(["limit:timestamp:api:...",
get(["limit:count:api:...",                            # can be memoized from previous query
get(["home_timeline:15460619",                         # determine which tweets to display
get(["favorites_timeline:15460619",                    # determine which tweets are favorited
get_multi([["Status:fragment:json:7964736693",         # load, in parallel, all of the tweets we're gonna display.

Note that all of the “limit:” queries above come from attempt_basic_auth. We noticed a few other (relatively minor) unnecessary queries as well. It seems like from this data we can eliminate seven out of seventeen total Memcached calls — a 42% improvement for the most popular page on Twitter.

At this point, we need to write some code to make these bad queries go away. Some of them we cache (so we don’t make the exact same query twice), some are just bugs and are easy to fix. Some we might try to parallelize (do more than one query at the same time). But this 42% optimization (especially if combined with new hardware) has the potential to eliminate the performance degradation of our Memcached cluster and also make most page loads that much faster. It is possible we could see a (substantially) greater than 50% increase in the capacity of Twitter with these optimizations.

This story presents a couple of the fundamental principles that we use to debug the performance problems that lead to whales. First, always proceed from the general to the specific. Here, we progressed from looking first at I/O and CPU timings to finally focusing on the specific Memcached queries that caused the issue. And second, live by the data, but don’t trust it. Despite the promise of a 50% gain that the data implies, it’s unlikely we’ll see any performance gain anywhere near that. Even still, it’ll hopefully be substantial.

— @asdf and @nk