Today, Twitter launched a personalized search experience to help our users find the most relevant Tweets, images, and videos. To build this product, our infrastructure needed to support two major features: relevance-filtering of search results and the identification of relevant images and photos. Both features leverage a ground-up rewrite of the search infrastructure, with Blender and Earlybird at the core.
Since the acquisition of Summize in 2008, Twitter has invested heavily in search. We’ve grown our search team from three to 15 engineers and scaled our real-time search engine by two orders of magnitude — all this, while we replaced the search infrastructure in flight, with no major service interruptions.
The engineering story behind the evolution of search is compelling. The Summize infrastructure used Ruby on Rails for the front-end and MySQL for the back-end (the same architecture as the one used by Twitter and many other start-ups). At the time, Lucene and other open-source search technology did not support real-time search. As a result, we constructed our reverse indexes in MySQL, leveraging its concurrent transactions and B-tree data structures to support concurrent indexing and searching. We were able to scale our MySQL-based solution surprisingly far by partitioning the index across multiple databases and replicating the Rails front-end. In 2008, Twitter search handled an average of 20 TPS and 200 QPS. By October 2010, when we replaced MySQL with Earlybird, the system was handling 1,000 TPS and 12,000 QPS on average.
Earlybird, a real-time, reverse index based on Lucene, not only gave us an order of magnitude better performance than MySQL for real-time search, it doubled our memory efficiency and provided the flexibility to add relevance filtering. However, we still needed to replace the Ruby on Rails front-end, which was only capable of synchronous calls to Earlybird and had accrued significant technical debt through years of scaling and transition to Earlybird.
In April 2011, we launched a replacement, called Blender, which improved our search latencies by 3x, gave us 10x throughput, and allowed us to remove Ruby on Rails from the search infrastructure. Today, we are indexing an average of 2,200 TPS while serving 18,000 QPS (1.6B queries per day!). More importantly, Blender completed the infrastructure necessary to make the most significant user-facing change to Twitter search since the acquisition of Summize.
When the team launched Earlybird, we were all excited by its potential — it was fast and the code was clean and easy to extend. While on vacation in Germany, Michael Busch, one of our search engineers, implemented a demo of image and video search. A few weeks later, during Twitter’s first Hack Week, the search team, along with some members of other teams, completed the first demo of our new search experience. Feedback from the company was so positive that the demo became part of our product roadmap.
There is a lot of information on Twitter — on average, more than 2,200 new Tweets every second! During large events, for example the #tsunami in Japan, this rate can increase by 3 to 4x. Often, users are interested in only the most memorable Tweets or those that other users engage with. In our new search experience, we show search results that are most relevant to a particular user. So search results are personalized, and we filter out the Tweets that do not resonate with other users.
To support relevance filtering and personalization, we needed three types of signals:
Getting all of these signals into our index required changes to our ingestion pipeline, Earlybird (our reverse index), and Blender (our front-ends). We also created a new updater component that continually pushes resonance signals to Earlybird. In the ingestion pipeline, we added a pipeline stage that annotates Tweets with static information, for example, information about the user and the language of the Tweet’s text. The Tweets are then replicated to the Earlybird indexes (in real time), where we have extended Lucene’s internal data structures to support dynamic updates to arbitrary annotations. Dynamic updates, for example, the users’ interactions with Tweets, arrive over time from the updater. Together, Earlybird and the updater support a high and irregular rate of updates without requiring locks or slowing down searches.
At query time, a Blender server parses the user’s query and passes it along with the user’s social graph to multiple Earlybird servers. These servers use a specialized ranking function that combines relevance signals and the social graph to compute a personalized relevance score for each Tweet. The highest-ranking, most-recent Tweets are returned to the Blender, which merges and re-ranks the results before returning them to the user.
Twitter search architecture with support for relevance
Duplicate and near-duplicate Tweets are often not particularly helpful in Twitter search results. During popular and important events, when search should be most helpful to our users, nearly identical Tweets increase in number. Even when the quality of the duplicates is high, the searcher would benefit from a more diverse set of results. To remove duplicates we use a technique based on MinHashing, where several signatures are computed per Tweet and two Tweets sharing the same set of signatures are considered duplicates. The twist? Like everything at Twitter, brevity is key: We have a very small memory budget to store the signatures. Our algorithm compresses each Tweet to just 4 bytes while still identifying the vast majority of duplicates with very low computational requirements.
Twitter is most powerful when you personalize it by choosing interesting accounts to follow, so why shouldn’t your search results be more personalized too? They are now! Our ranking function accesses the social graph and uses knowledge about the relationship between the searcher and the author of a Tweet during ranking. Although the social graph is very large, we compress the meaningful part for each user into a Bloom filter, which gives us space-efficient constant-time set membership operations. As Earlybird scans candidate search results, it uses the presence of the Tweet’s author in the user’s social graph as a relevance signal in its ranking function.
Even users that follow few or no accounts will benefit from other personalization mechanisms; for example, we now automatically detect the searcher’s preferred language and location.
Images and videos have an amazing ability to describe people, places, and real-time events as they unfold. Take for example @jkrums’ Twitpic of US Airways Flight 1549 Hudson river landing, and @stefmara’s photos and videos of space shuttle Endeavour’s final launch.
There is a fundamental difference between searching for Tweets and searching for entities in Tweets, such as images and videos. In the former case, the decision about whether a Tweet matches a query can be made by looking at the text of the Tweet, with no other outside information. Additionally, per-Tweet relevance signals can be used to rank and compare matching Tweets to find the best ones. The situation is different when searching for images or videos. For example, the same image may be tweeted many times, with each Tweet containing different keywords that all describe the image. Consider the following Tweets:
One possible description of the image is formed from the union of keywords in the Tweets’ text; that is, “dog”, “Australian”, and “shepherd” all describe the image. If an image is repeatedly described by a term in the Tweet’s text, it is likely to be about that term.
So what makes this a difficult problem? Twitter allows you to search Tweets within seconds; images and photos in tweets should be available in realtime too! Earlybird uses inverted indexes for search. While these data structures are extremely efficient, they do not support inline updates, which makes it nearly impossible to append additional keywords to indexed documents.
If timeliness was not important, we could use MapReduce jobs that periodically aggregate keyword unions and produce inverted indexes. In these offline indexes, each link to an image or photo link would be a document, with the aggregated keywords as the document’s text. However, to meet our indexing latency goals, we would have to run these MapReduce jobs every few seconds, an impractical solution.
Instead, we extended Earlybird’s data structures to support efficient lookups of entities contained in Tweets. At query time, we look up the images and videos for matching Tweets and and store them in a custom hash map. The keys of the map are URLs and the values are score counters. Each time the same URL is added to the map, its corresponding score counter is incremented. After this aggregation is complete, the map is sorted and the best images and photos are returned for rendering.
The search team is excited to build innovative search products that drive discovery and help our users. While the new search experience is a huge improvement over pure real-time search, we are just getting started. In the coming months, we will improve quality, scale our infrastructure, expand our indexes, and bring relevance to mobile.
The following people contributed to the launch: Abhi Khune, Abdur Chowdhury, Aneesh Sharma, Ashok Banerjee, Ben Cherry, Brian Larson, Coleen Baik, David Chen, Frost Li, Gilad Mishne, Isaac Hepworth, Jon Boulle, Josh Brewer, Krishna Gade, Michael Busch, Mike Hayes, Nate Agrin, Patrick Lok, Raghavendra Prabu, Sarah Brown, Sam Luckenbill, Stephen Fedele, Tian Wang, Yi Zhuang, Zhenghua Li.
We would also like to thank the original Summize team, former team members, hack-week contributors, and management for their contributions and support.
Did someone say … cookies?