Twitter's New Search Architecture

Wednesday, 6 October 2010

If we have done a good job then most of you shouldn’t have noticed that we launched a new backend for search on twitter.com during the last few weeks! One of our main goals, but also biggest challenges, was a smooth switch from the old architecture to the new one, without any downtime or inconsistencies in search results. Read on to find out what we changed and why.

Twitter’s real-time search engine was, until very recently, based on the technology that Summize originally developed. This is quite amazing, considering the explosive growth that Twitter has experienced since the Summize acquisition. However, scaling the old MySQL-based system had become increasingly challenging.

The new technology

About 6 months ago, we decided to develop a new, modern search architecture that is based on a highly efficient inverted index instead of a relational database. Since we love Open Source here at Twitter we chose Lucene, a search engine library written in Java, as a starting point.

Our demands on the new system are immense: With over 1,000 TPS (Tweets/sec) and 12,000 QPS (queries/sec) = over 1 billion queries per day (!) we already put a very high load on our machines. As we want the new system to last for several years, the goal was to support at least an order of magnitude more load.

Twitter is real-time, so our search engine must be too. In addition to these scalability requirements, we also need to support extremely low indexing latencies (the time it takes between when a Tweet is tweeted and when it becomes searchable) of less than 10 seconds. Since the indexer is only one part of the pipeline a Tweet has to make it through, we needed the indexer itself to have a sub-second latency. Yes, we do like challenges here at Twitter! (btw, if you do too: @JoinTheFlock!)

Modified Lucene

Lucene is great, but in its current form it has several shortcomings for real-time search. That’s why we rewrote big parts of the core in-memory data structures, especially the posting lists, while still supporting Lucene’s standard APIs. This allows us to use Lucene’s search layer almost unmodified. Some of the highlights of our changes include:

significantly improved garbage collection performance
lock-free data structures and algorithms
posting lists, that are traversable in reverse order
efficient early query termination

We believe that the architecture behind these changes involves several interesting topics that pertain to software engineering in general (not only search). We hope to continue to share more on these improvements.

And, before you ask, we’re planning on contributing all these changes back to Lucene; some of which have already made it into Lucene’s trunk and its new realtime branch.

Benefits

Now that the system is up and running, we are very excited about the results. We estimate that we’re only using about 5% of the available backend resources, which means we have a lot of headroom. Our new indexer could also index roughly 50 times more Tweets per second than we currently get! And the new system runs extremely smoothly, without any major problems or instabilities (knock on wood).

But you might wonder: Fine, it’s faster, and you guys can scale it longer, but will there be any benefits for the users? The answer is definitely yes! The first difference you might notice is the bigger index, which is now twice as long — without making searches any slower. And, maybe most importantly, the new system is extremely versatile and extensible, which will allow us to build cool new features faster and better. Stay tuned!

The engineers who implemented the search engine are: Michael Busch, Krishna Gade, Mike Hayes, Abhi Khune, Brian Larson, Patrick Lok, Samuel Luckenbill, Jake Mannix, Jonathan Reichhold.