Today’s turbulence explained

By
Thursday, 21 June 2012

Not how we wanted today to go. At approximately 9:00am PDT, we discovered that Twitter was inaccessible for all web users, and mobile clients were not showing new Tweets. We immediately began to investigate the issue and found that there was a cascading bug in one of our infrastructure components. This wasn’t due to a hack or our new office or Euro 2012 or GIF avatars, as some have speculated today. A “cascading bug” is a bug with an effect that isn’t confined to a particular software element, but rather its effect “cascades” into other elements as well. One of the characteristics of such a bug is that it can have a significant impact on all users, worldwide, which was the case today. As soon as we discovered it, we took corrective actions, which included rolling back to a previous stable version of Twitter. We began recovery at around 10:10am PDT, dropped again around 10:40am PDT, and then began full recovery at 11:08am PDT. We are currently conducting a comprehensive review to ensure that we can avoid this chain of events in the future. For the past six months, we’ve enjoyed our highest marks for site reliability and stability ever: at least 99.96% and often 99.99%. In simpler terms, this means that in an average 24-hour period, twitter.com has been stable and available to everyone for roughly 23 hours, 59 minutes and 40-ish seconds. Not today though. We know how critical Twitter has become for you — for many of us. Every day, we bring people closer to their heroes, causes, political movements, and much more. One user, Arghya Roychowdhury, put it this way:

It’s imperative that we remain available around the world, and today we stumbled. For that we offer our most sincere apologies and hope you’ll be able to breathe easier now. - Mazen Rawashdeh, VP, Engineering (@mazenra)