On Monday, a fault in the database that stores Twitter user records caused problems on both Twitter.com and our API. The short, non-technical explanation is that a mistake led to some problems that we were able to fix without losing any data.
While we were able to resolve these issues by Tuesday morning, we want to talk about what happened and use this an opportunity to discuss the recent progress we’ve made in improving Twitter’s performance and availability. We recently covered these topics in a pair of June posts here and on our company blog).
Making sure Twitter is a stable platform and a reliable service is our number one priority. The bulk of our engineering efforts are currently focused on this effort, and we have moved resources from other important projects to focus on the issue.
As we said last month, keeping pace with record growth in Twitter’s user base and activity presents some unique and complex engineering challenges. We frequently compare the tasks of scaling, maintaining, and tweaking Twitter to building a rocket in mid-flight.
During the World Cup, Twitter set records for usage. While the event was happening, our operations and infrastructure engineers worked to improve the performance and stability of the service. We have made more than 50 optimizations and improvements to the platform, including:
While we’re continuously improving the performance, stability and scalability of our infrastructure and core services, there are still times when we run into problems unrelated to Twitter’s capacity. That’s what happened this week.
On Monday, our users database, where we store millions of user records, got hung up running a long-running query; as a result, most of the table became locked. The locked users table manifested itself in many ways: users were unable to sign-up, sign in, update their profile or background images, and responses from the API were malformed, rendering the response unusable to many of the API clients. In the end, this affected most of the Twitter ecosystem: our mobile, desktop, and web-based clients, the Twitter support and help system, and Twitter.com.
To remedy the locked table, we force-restarted the database server in recovery mode, a process that took more than 12 hours (the database covers records for more than 125 million users — that’s a lot of records). During the recovery, the users table and related tables remained unavailable. Unfortunately, even after the recovery process completed, the table remained in an unusable state. Finally, yesterday morning we replaced the partially-locked user db with a copy that was fully available (in the parlance of database admins everywhere, we promoted a slave to master), fixing the database and all of the related issues.
We have taken steps to ensure we can more quickly detect and respond to similar issues in the future. For example, we are prepared to more quickly promote a slave db to a master db, and we put additional monitoring in place to catch errant queries like the one that caused Monday’s incidents.
As we said last month, we are working on long-term solutions to make Twitter more reliable (news that we are moving into our own data center this fall, which we announced this afternoon, is just one example). This will take time, and while there has been short-term pain, our capacity has improved over the past month.
Finally, despite the rapid growth of our company, we’re still a relatively small crew maintaining a comparatively large (rocket) ship. We’re actively looking for engineering talent, with more than 20 openings currently. If you’re interested in learning more about the problems we’re solving or “joining the flock,” check out our jobs page.
Did someone say … cookies?