Twitter is an amazing real-time information dissemination platform. We’ve seen events of historical importance such as the Arab Spring unfold via Tweets. We even know that Twitter is faster than earthquakes! However, can we more scientifically characterize the real-time nature of Twitter?
One way to measure the dynamics of a content system is to test how quickly the distribution of terms and phrases appearing in it changes. A recent study we’ve done does exactly this: looking at terms and phrases in Tweets and in real-time search queries, we see that the most frequent terms in one hour or day tend to be very different from those in the next — significantly more so than in other content on the web. Informally, we call this phenomenon churn.
This week, we are presenting a short paper at the International Conference on Weblogs and Social Media (ICWSM 2012), in which @gilad and I examine this phenomenon. An extended version of the paper, titled “A Study of ‘Churn’ in Tweets and Real-Time Search Queries”, is available here. Some highlights:
What does this mean? News breaks on Twitter, whether local or global, of narrow or broad interest. When news breaks, Twitter users flock to the service to find out what’s happening. Our goal is to instantly connect people everywhere to what’s most meaningful to them; the speed at which our content (and the relevance signals stemming from it) evolves make this more technically challenging, and we are hard at work continuously refining our relevance algorithms to address this. Just to give one example: search, boiled down to its basics, is about computing term statistics such as term frequency and inverse document frequency. Most algorithms assume some static notion of underlying distributions — which surely isn’t the case here!
In addition, we’re presenting a paper at the co-located workshop on Social Media Visualization, where @miguelrios and I share some of our experiences in using data visualization techniques to generate insights from the petabytes of data in our data warehouse. You’ve seen some of these visualizations before, for example, about the 2010 World Cup and 2011 Japan earthquake. In the paper, we present another visualization, of seasonal variation of tweeting patterns for users in four different cities (New York City, Tokyo, Istanbul, and Sao Paulo). The gradient from white to yellow to red indicates amount of activity (light to heavy). Each tile in the heatmap represents five minutes of a given day and colors are normalized by day. This was developed internally to understand why growth patterns in Tweet-production experience seasonal variations.
We see different patterns of activity between the four cities. For example, waking/sleeping times are relatively constant throughout the year in Tokyo, but the other cities exhibit seasonal variations. We see that Japanese users’ activities are concentrated in the evening, whereas in the other cities there is more usage during the day. In Istanbul, nights get shorter during August; Sao Paulo shows a time interval during the afternoon when Tweet volume goes down, and also longer nights during the entire year compared to the other three cities.
Finally, we’re also giving a keynote at the co-located workshop on Real-Time Analysis and Mining of Social Streams (RAMSS), fitting very much into the theme of our study. We’ll be reviewing many of the challenges of handling real-time data, including many of the issues described above.
Interested in real-time systems that deliver relevant information to users? Interested in data visualization and data science? We’re hiring! Join the flock!
- Jimmy Lin, Research Scientist, Analytics (@lintool)