A Perfect Storm.....of Whales

By ‎@jeanpaul‎

Since Saturday, Twitter has experienced several incidences of poor site performance and a high number of errors due to one of our internal sub-networks being over-capacity. We’re working hard to address the core issues causing these problems—more on that below—but in the interests of the open exchange of information, wanted to pull back the curtain and give you deeper insight into what happened and how we’re working to address this week’s poor site performance.

What happened?

In brief, we made three mistakes:
* We put two critical, fast-growing, high-bandwith components on the same segment of our internal network.
* Our internal network wasn’t appropriately being monitored.
* Our internal network was temporarily misconfigured.

What we’re doing to fix it

* We’ve doubled the capacity of our internal network.
* We’re improving the monitoring of our internal network.
* We’re rebalancing the traffic on our internal network to redistribute the load.

Onward

For much of 2009, Twitter’s biggest challenge was coping with our unprecedented growth (a challenge we happily still face). Our engineering team spent much of 2009 redesigning Twitter’s runtime for scale, and our operations team worked to improve our monitoring and capacity planning so we can quickly identify and find solutions for problems as they occur. Those efforts were well spent; every day, more people use Twitter, yet we serve fewer whales. But as this week’s issues show, there is always room for improvement: we must apply the same diligence & care in the design, planning, and monitoring of our internal network.

Based on our experiences this week, we’re working with our hosting partner to deliver improvements on all three fronts. By bringing the monitoring of our internal network in line with the rest of the systems at Twitter, we’ll be able to grow our capacity well ahead of user growth. Furthermore, by doubling our internal network capacity and rebalancing load across the internal network, we’re better prepared to serve today’s tweets and beyond.

As more people turn to Twitter to see what’s happening in the world (or in the World Cup), you may still see the whale when there are unprecedented spikes in traffic. For instance, during the World Cup tournament—and particularly during big, closely-watched matches (such as tomorrow’s match between England and the U.S.A.)—we anticipate a significant surge in activity on Twitter. While we are making every effort to prepare for that surge, the whale may surface.

Finally, as we think about new ways to communicate with you about Twitter’s performance and availability status, continue reading http://status.twitter.com, http://dev.twitter.com/status, and following @twitterapi for the latest updates.

Thanks for your continued patience and enthusiasm.

@jeanpaul