Introducing Omnisearch

Thursday, 5 May 2016

Twitter has more than 310 million monthly active users who send hundreds of millions of Tweets per day, from all over the world. To make sure everyone sees the Tweets that matter most to them, we’ve been working on features that bring the best content to the forefront. We’ve refreshed the Home timeline to highlight the best Tweets first, introduced tailored content via Highlights for Android, and personalized the search results and trends pages.

The performance of these products depends on finding the most relevant Tweets from a large set of candidates, based on a product-specific definition of “relevant.” From an engineering point of view, we view these as information retrieval problems where the documents are Tweets and the product is defined by a query. For example, to show you the best Tweets first in your Home timeline, we might first find candidate Tweets from accounts you follow with a search query like this one:

In our search index, we rank the candidate Tweets and select the best few using signals that correlate to the likelihood you’ll engage with them. By building products based on search infrastructure, we can speed up development while simultaneously reducing the inherent risk of developing custom new infrastructure. Furthermore, it allows product teams to experiment quickly — changing a search query is easy!

Omnisearch: an information retrieval system

To meet the needs of these (and future) products, we’re building a new information retrieval system called Omnisearch. Like databases, information retrieval systems match documents (e.g., Tweets or web pages) to users’ queries. However, unlike databases, the documents are usually ranked and may not exactly match the query. While the most common application of an information retrieval system is web search, their power and flexibility make them useful for many other products.

Earlybird, our Lucene-based search indexing technology, has been rock-solid for years. We index new Tweets within seconds and have a large archival index of all public Tweets. These systems can handle upwards of 60K queries per second while simultaneously indexing up to 80K Tweets per second. We maintain a service level agreement (SLA) of 99.98% uptime and (for most queries) our latencies are under 100ms. However, Earlybird was primarily built to serve Twitter search and trends landing pages, so it lacks some of the flexibility we need for building the next generation of Twitter’s products.

The new products we want to build will require fields, operators, and ranking signals to be added to our indexes. For example, to build a new media product we might want a new operator to find Tweets with GIFs, or to use content as a ranking signal. When we started Omnisearch, this was difficult or impossible. Our in-memory Earlybird indexes were already using most of their available memory, and our archival indexes required a slow rebuild before new fields and signals became available.

After several months of work, these hard limitations have been removed. When we’re done with Omnisearch, product engineering teams will be able to do this work on their own.

Different products also have different reliability and scaling requirements. This is challenging in both directions. For small applications, a technology that can serve 100K QPS may be overly complex. On the other hand, we’re nearing the scaling limits of the current architecture: we can scale what we have by 2x but not by 10x. Twitter’s Home timeline requires higher uptime than some of our other products because when it’s down, Twitter is down.

Ultimately, we’d like Omnisearch to scale by an order of magnitude beyond our current systems in several dimensions: success rate, indexing latency, query latency, indexing throughput, and query throughput. In the next phase of development for Omnisearch, we will be tackling these challenges via a series of architectural projects.

Finally, new products may require completely new indexes. For example, we may want to build indexes of Moments, Vines, and Periscope broadcasts. Currently, the core search infrastructure team only maintains indexes of Tweets and users. While bringing up new Earlybird indexes for other types of documents is possible, it requires custom work and the engineering cost often outweighs the value.

Our ultimate vision for Omnisearch is to provide search as a service, allowing us to build entirely new kinds of products.

Over the coming months, our search infrastructure team will be posting a series of blog posts detailing the transformation of our existing infrastructure into Omnisearch. Stay tuned!