The Great Migration, the Winter of 2011

Monday, 21 March 2011

If you look back at the history of Twitter, our rate of growth has largely outpaced the capacity of our hardware, software, and the company itself. Indeed, in our first five years, Twitter’s biggest challenge was coping with our unprecedented growth and sightings of the infamous Fail Whale.

These issues came to a head last June when Twitter experienced more than ten hours of downtime. However, unlike past instances of significant failure, we said at the time that that we had a long-term plan.

Last September, we began executing on this plan and undertook the most significant engineering challenge in the history of Twitter. We hope it will have a significant impact the service’s success for many years to come. During this time, the engineers and operations teams moved Twitter’s infrastructure to a new home while making changes to our infrastructure and our organization that will ensure that we can constantly stay abreast of our capacity needs; give users and developers greater reliability; and, allow for new product offerings.

This was our season of migration.

Redesigning and Rebuilding the Bird Mid-flight

Under the hood, Twitter is a complex yet elegant distributed network of queues, daemons, caches, and databases. Today, the feed and care of Twitter requires more than 200 engineers to keep the site growing and running smoothly. What did moving the entirety of Twitter while improving up-time entail? Here’s a simplified version of what we did.

First, our engineers extended many of Twitter’s core systems to replicate Tweets to multiple data centers. Simultaneously, our operations engineers divided into new teams and built new processes and software to allow us to qualify, burn-in, deploy, tear-down and monitor the thousands of servers, routers, and switches that are required to build out and operate Twitter. With hardware at a second data center in place, we moved some of our non-runtime systems there – giving us headroom to stay ahead of tweet growth. This second data center also served as a staging laboratory for our replication and migration strategies. Simultaneously, we prepped a third larger data center as our final nesting ground.

Next, we set out rewiring the rocket mid-flight by writing Tweets to both our primary data center and the second data center. Once we proved our replication strategy worked, we built out the full Twitter stack, and copied all 20TB of Tweets, from @jack’s first to @honeybadger’s latest Tweet to the second data center. Once all the data was in place we began serving live traffic from the second data center for end-to-end testing and to continue to shed load from our primary data center. Confident that our strategy for replicating Twitter was solid, we moved on to the final leg of the migration, building out and moving all of Twitter from the first and second data centers to the final nesting grounds. This essentially required us to move much of Twitter two times.

What’s more, during the migration we set a new Tweet per second record, continued to grow, launched new products, while improving the security and up-time of our service.

A Flock

The effort and planning behind this effort were huge. Vacations were put off, weekends were worked, more than a few strategic midnight oil reserves were burned in this two-stage move. The technical accomplishments by the operations and engineering teams that made this move possible were immense. Equally great, was the organization and alignment of the engineering and operations teams, their ability to create lightweight robust processes where none had existed before. Without this cohesion, this flocking of sorts, none of this would have been possible.

Though spring is here, and this particular season of migration is over, it represents more of a beginning than an ending. This move gives us the capacity to deliver Tweets with greater reliability and speed, and creates more runway to focus on the most interesting operations and engineering problems. It’s an immense opportunity to innovate and build the products and technologies that our users request and our talented engineers love to develop.

—The Twitter Engineering Team

P.S. Twitter is hiring across engineering and operations. If you want to develop novel systems that scale on the order of billions, join the flock.