Studying rapidly evolving user interests

Monday, 4 June 2012

Twitter is an amazing real-time information dissemination platform. We’ve seen events of historical importance such as the Arab Spring unfold via Tweets. We even know that Twitter is faster than earthquakes! However, can we more scientifically characterize the real-time nature of Twitter?

One way to measure the dynamics of a content system is to test how quickly the distribution of terms and phrases appearing in it changes. A recent study we’ve done does exactly this: looking at terms and phrases in Tweets and in real-time search queries, we see that the most frequent terms in one hour or day tend to be very different from those in the next — significantly more so than in other content on the web. Informally, we call this phenomenon churn.

This week, we are presenting a short paper at the International Conference on Weblogs and Social Media (ICWSM 2012), in which @gilad and I examine this phenomenon. An extended version of the paper, titled “A Study of ‘Churn’ in Tweets and Real-Time Search Queries”, is available here. Some highlights:

Examining all search queries from October 2011, we see that, on average, about 17% of the top 1000 query terms from one hour are no longer in the top 1000 during the next hour. In other words, 17% of the top 1000 query terms “churn over” on an hourly basis.
Repeating this at a granularity of days instead of hours, we still find that about 13% of the top 1000 query terms from one day are no longer in the top 1000 during the next day.
During major events, the frequency of queries spike dramatically. For example, on October 5, immediately following news of the death of Apple co-founder and CEO Steve Jobs, the query “steve jobs” spiked from a negligible fraction of query volume to 15% of the query stream — almost one in six of all queries issued! Check it out: the query volume is literally off the charts! Notice that related queries such as “apple” and “stay foolish” spiked as well.

What does this mean? News breaks on Twitter, whether local or global, of narrow or broad interest. When news breaks, Twitter users flock to the service to find out what’s happening. Our goal is to instantly connect people everywhere to what’s most meaningful to them; the speed at which our content (and the relevance signals stemming from it) evolves make this more technically challenging, and we are hard at work continuously refining our relevance algorithms to address this. Just to give one example: search, boiled down to its basics, is about computing term statistics such as term frequency and inverse document frequency. Most algorithms assume some static notion of underlying distributions — which surely isn’t the case here!

In addition, we’re presenting a paper at the co-located workshop on Social Media Visualization, where @miguelrios and I share some of our experiences in using data visualization techniques to generate insights from the petabytes of data in our data warehouse. You’ve seen some of these visualizations before, for example, about the 2010 World Cup and 2011 Japan earthquake. In the paper, we present another visualization, of seasonal variation of tweeting patterns for users in four different cities (New York City, Tokyo, Istanbul, and Sao Paulo). The gradient from white to yellow to red indicates amount of activity (light to heavy). Each tile in the heatmap represents five minutes of a given day and colors are normalized by day. This was developed internally to understand why growth patterns in Tweet-production experience seasonal variations.

We see different patterns of activity between the four cities. For example, waking/sleeping times are relatively constant throughout the year in Tokyo, but the other cities exhibit seasonal variations. We see that Japanese users’ activities are concentrated in the evening, whereas in the other cities there is more usage during the day. In Istanbul, nights get shorter during August; Sao Paulo shows a time interval during the afternoon when Tweet volume goes down, and also longer nights during the entire year compared to the other three cities.

Finally, we’re also giving a keynote at the co-located workshop on Real-Time Analysis and Mining of Social Streams (RAMSS), fitting very much into the theme of our study. We’ll be reviewing many of the challenges of handling real-time data, including many of the issues described above.

Interested in real-time systems that deliver relevant information to users? Interested in data visualization and data science? We’re hiring! Join the flock!

- Jimmy Lin, Research Scientist, Analytics (@lintool)