Millions of people all over the world search on Twitter every day to see what’s happening. During major events such as the recent Euro 2016 final, we observe record traffic spikes as people turn to Twitter to find timely information and perspectives, and overall traffic volume has been steadily increasing over time. The Search Quality team at Twitter works on returning the best quality results for our users.
Compared to traditional information retrieval applications, the Twitter search challenge is unique, for a few reasons:
In order to return relevant, high quality search results at this scale with low latency, we need to solve interesting and novel technical challenges in a variety of areas: information retrieval, natural language processing, machine learning, distributed systems, data science, etc.
Over the last few months, we’ve made significant investments in our search relevance infrastructure with the goal of improving ranking capabilities and experimentation efficiency. This post highlights some of this work. Note that this is distinct from our core indexing and retrieval platform components that we query in production to retrieve Tweets (unranked).
Real-Time Signal Ingester
The variety and timeliness of signals used by our ranking models have a huge impact on the ultimate quality of search results. Additionally, many of the signals mutate rapidly after the Tweets have been indexed, so we need to keep them up to date. We wrote a new Heron-based signal ingester to process streams of raw signals and produce features for our ranking components to use in production. We added flexible schemas for encoding and decoding new feature updates dynamically with minimum code changes and operational overhead. As the Twitter app evolves, we can quickly add and test new ranking signals that become available and appear promising in offline experiments.
Fast, Lightweight Experimentation
The faster and cheaper we can make the ideate->test->iterate loop, the more ideas we can test and the more we can innovate. We make heavy use of traditional A/B testing, but we’ve also built a complementary offline experimentation system to test changes more efficiently. Twitter search results and queries churn rapidly, so to separate signal from noise we built a sandbox environment that freezes the state of the world at a given point in time so we can generate stable, reproducible results for any change we want to test. In order to gain better insight, we’ve added tooling to analyze and display differences between results, and easily obtain judgment labels from in-house human raters based on our Search Quality Judgment Guidelines. One particularly nice benefit is that this allows us to validate expensive index changes, e.g. adding new index fields for retrieval, tokenization updates, etc., and refine them before deploying to production.
Training and Deploying Machine Learned Models
Machine learned models are commonly used for search ranking as they provide a principled and automatic way to optimize feature weights and integrate new ranking features. To make them work well, it’s important to identify the right objective functions to optimize that correlate well with ultimate customer satisfaction. We established a pipeline to seamlessly collect training data sets for model training and validation, and deploy trained models to production servers. Scale brings additional challenges, e.g. the first stage of search ranking happens on index shards within a very tight loop where a large number of matching documents for a query are scored under strict CPU, memory and latency constraints. We worked with the Twitter Cortex team to create a lightweight runtime that enables running models under these constraints and deployed ranking models trained using our internal ML platform tools, e.g. Whetlab.
These are critical building blocks that have allowed us to test and ship many relevance gains making search better for our users. In future posts, we’ll dive deeper into specific aspects of search quality and projects we’re currently working on. Stay tuned!
The Search Quality team is Tian Wang, Juan Caicedo, Zhezhe Chen, Jinliang Fan, Lisa Huang, Gianna Badiali, Yan Xia and Yatharth Saraf. We would also like to thank the Search Infrastructure, Heron and Cortex teams for invaluable assistance at various stages.
Did someone say … cookies?