Improving Twitter search with real-time human computation

One of the magical things about Twitter is that it opens a window to the world in real-time. An event happens, and seconds later, people share it across the planet.

Consider, for example, what happened when Flight 1549 crashed in the Hudson River:


Or when Osama bin Laden was killed:


Or when Obama was re-elected:


When each of these events happened, people instantly came to Twitter, and in particular searched on Twitter to discover what was happening.

From a search and advertising perspective, however, these sudden events pose several challenges:

  1. The queries people perform have probably never before been seen, so it’s impossible to know without very specific context what they mean. How would you know that #bindersfullofwomen refers to politics, and not office accessories, or that people searching for “horses and bayonets” are interested in the Presidential debates?

  2. Since these spikes in search queries are so short-lived, there’s only a small window of opportunity to learn what they mean.

So an event happens, people instantly come to Twitter to search for the event, and we need to teach our systems what these queries mean as quickly as we can — because in just a few hours, the search spike will be gone.

How do we do this? We’ve built a real-time human computation engine to help us identify search queries as soon as they’re trending, send these queries to real humans to be judged, and then incorporate the human annotations into our back-end models.

Overview

Before we delve into the details, here’s an overview of how the system works.

  1. First, we monitor for which search queries are currently popular.
    Behind the scenes: we run a Storm topology that tracks statistics on search queries.
    For example, the query [Big Bird] may suddenly see a spike in searches from the US.

  2. As soon as we discover a new popular search query, we send it to our human evaluators, who are asked a variety of questions about the query.
    Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon’s Mechanical Turk service, and then polls Mechanical Turk for a response.
    For example: as soon as we notice “Big Bird” spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant Tweets and ads.

  3. Finally, after a response from an evaluator is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our evaluators tell us that [Big Bird] is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

Monitoring for popular queries

Storm is a distributed system for real-time computation. In contrast to batch systems like Hadoop, which often introduce delays of hours or more, Storm allows us to run online data processing algorithms to discover search spikes as soon as they happen. In brief, running a job on Storm involves creating a Storm topology that describes the processing steps that must occur, and deploying this topology to a Storm cluster. A topology itself consists of three things:

  1. Tuple streams of data. In our case, these may be tuples of (search query, timestamp).

  2. Spouts that produce these tuple streams. In our case, we attach spouts to our search logs, which get written to every time a search occurs.

  3. Bolts that process tuple streams. In our case, we use bolts for operations like updating total query counts, filtering out non-English queries, and checking whether an ad is currently being served up for the query.

Here’s a step-by-step walkthrough of how our query topology works:

  1. Whenever you perform a search on Twitter, the search request gets logged to a Kafka queue.

  2. The Storm topology attaches a spout to this Kafka queue, and the spout emits a tuple containing the query and other metadata (e.g., the time the query was issued and its location) to a bolt for processing.

  3. This bolt updates the count of the number of times we’ve seen this query, checks whether the query is “currently popular” (using various statistics like time-decayed counts, the geographic distribution of the query, and the last time this query was sent for annotations), and dispatches it to our human computation pipeline if so.

One interesting feature of our popularity algorithm is that we often re-judge queries that have been annotated before, since the intent of a search can change. For example, people may normally search for [Clint Eastwood] because they’re interested in his movies, but during the 2012 Republican National Convention users may have wanted to see Tweets related to his speech there.

Human evaluation of popular search queries

We use human computation for a variety of tasks. (See also Clockwork Raven, an open-source project we built that makes launching tasks easier.) For example, we often run experiments to measure ad relevance and search quality, we use it to gather data to train and evaluate our machine learning models. In this section we’ll describe how we use it to boost our understanding of popular search queries.

Suppose that our Storm topology has detected that the query [Big Bird] is suddenly spiking. Since the query may remain popular for only a few hours, we send it off to live humans, who can help us quickly understand what it means; this dispatch is performed via a Thrift service that allows us to design our tasks in a web frontend, and later programmatically submit them to Mechanical Turk using any of the different languages we use across Twitter.

On Mechanical Turk, judges are asked several questions about the query that help us serve better ads. Without going into the exact questions, here are flavors of a few possibilities:

- What category does the query belong to? For example, [Stanford] may typically be an education-related query, but perhaps there’s a football game between Stanford and Berkeley at the moment, in which case the current search intent would be sports.

- Does the query refer to a person? If so, who? And, what is their Twitter handle if they have one? For example, the query [Happy Birthday Harry] may be trending, but it’s hard to know beforehand which of the numerous celebrities named Harry it’s referring to. Is it One Direction’s Harry Styles, in which case the searcher is likely to be interested in teen pop? Harry Potter, in which case the searcher is likely to be interested in fantasy novels? Or someone else entirely?

Turkers in the machine

Since humans are core to this system, our workforce was designed to give us fast and reliable results.

For completing all our tasks, we use a small custom pool of Mechanical Turk judges to ensure high quality. Other typical possibilities in the crowdsourcing world are to use a static set of in-house judges, to use the standard worker filters that Amazon provides, or to go through an outside company like Crowdflower. We’ve experimented with these other solutions, and while they have their own benefits, we found that a custom pool fit our needs best for a few reasons:

- A typical industry standard for human evaluation is to use in-house judges. They usually provide high-quality work as they become experts at the evaluation task domain. In-house judges are unfortunately hard to scale as they require standardized hiring processes to be in place. They also tend to be relatively more expensive, it can be harder to communicate with them, and their schedules can be difficult to work with.

- Using the standard sets of workers that Mechanical Turk or other crowdsourcing platforms provide makes it easy to scale the workforce, but we’ve found that their quality doesn’t always meet our needs. Two methods of ensuring high quality are to seed gold-standard examples for which you know the true response throughout your task, or to use statistical analysis to determine which workers are the good ones. But these can be time-consuming and expensive to create, and we often run tasks that can have free-response or require some background research, for which these solutions don’t work. Another problem is that using these filters gives you a fluid and constantly changing set of workers — which makes them hard to train.

In contrast:

- Our custom pool of judges work virtually all day. For many of them, this is a full-time job, and they’re geographically distributed, so our tasks complete quickly at all hours. We can easily ask for thousands of judgments before lunch, and have them finished by the time we need, which makes iterating on our experiments much easier.

- We have several forums, mailing lists, and even live chat rooms set up, all of which makes it easy for judges to ask us questions and to respond to feedback. Our judges will even give us suggestions on how to improve our tasks; for example, when we run categorization tasks, they’ll often report helpful categories that we should add.

- Since we only launch tasks on demand, and Amazon provides a ready source of workers if we ever need more, our judges are never twiddling their thumbs waiting for tasks or completing busywork, and our jobs are rarely backlogged.

- Because our judges are culled from the best of Mechanical Turk, they’re experts at the kinds of tasks we send, and can often provide higher quality at a faster rate than what even in-house judges provide. For example, they’ll often use the forums and chatrooms to collaborate amongst themselves to give us the best judgments, and they’re already familiar with the Firefox and Chrome scripts that help them be the most efficient at their work.

All the benefits described above are especially valuable in this real-time search annotation case:

- Having highly trusted workers means we don’t need to wait for multiple annotations on a single search query to confirm validity, so we can send responses to our backend as soon as a single judge responds. This entire pipeline is design for real-time, after all, so the lower the latency on the human evaluation part, the better.

- The static nature of our custom pool means that the judges are already familiar with our questions, and don’t need to be trained again.

- Because our workers aren’t limited to a fixed schedule or location, they can work anywhere, anytime — which is a requirement for this system, since global event spikes on Twitter are not limited to a standard 40-hour work week.

- And with the multiple easy avenues of communication we have set up, it’s easy for us to answer questions that might arise when we add new questions or modify existing ones.

Singing telegram summary

As an example of the kind of top quality our workers provide, we crowdsourced a singing telegram to celebrate the project’s launch. Here’s what they came up with:

This video was created entirely by our workers, from the crowdsourced lyrics, to the crowdsourced graphics, and even the piano playing and singing. Special thanks in particular to our amazing Turker, workasaurusrex, the musician and silky smooth crooner who brought the masterpiece together.

Many thanks to the Revenue and Storm teams, as well as our Turkers, for their help in launching this project.

Posted by Edwin Chen and Alpa Jain