Insights

Building a data stream to assist with COVID-19 research

By Adrian Chen
Tuesday, 16 March 2021

In the weeks surrounding the first wave of COVID-19 closures in March 2020, our teams saw tens of millions of Tweets from around the world discussing this unprecedented crisis. Every day, we saw a new evolution in what was happening, as the volume of public conversation grew larger than the day before.

We knew we couldn’t help with the lack of toilet paper on the shelves, but our product and engineering teams wanted to contribute any way we could. So when staff product manager Adam Tornes came forward with the idea to enable researchers to study this public conversation about COVID-19 in real-time, our team jumped on the chance to get started.

Our goal was to enable researchers and scientists with a way to study the public conversation around COVID-19 in real time. With the product we built, researchers went on to study the disease, crisis management, and emergency response, while aiming to understand the spread of misinformation and develop machine learning and data tools to support the scientific community. In this post, we'll focus on the technology of how we built this API.

This Tweet is unavailable
This Tweet is unavailable.

Identifying topical Tweets 

Our first step was to identify Tweets related to COVID-19 that should be pulled into this real-time stream. Generally, we tag a topic based on a combination of techniques that include machine learning, hashtag identification, a complex set of keyword matches, and which people or organizations are Tweeting about it. For example, the CDC, global health experts, and government officials are entities whose voices we wanted to comprehensively include. We cast a wide net, preferring false positives over false negatives, so that researchers can get programmatic access to as many of these Tweets as possible and do additional fine tuning if they desire. Moreover, our API captures Tweets matching our criteria not just in English but in all our supported languages: English, Spanish, Portuguese, Japanese, Arabic, Hindi, Indonesian, and Korean.

Tweets are tagged immediately as they are created, and thus can be delivered to researchers in as quickly as a few seconds after Tweet creation time. These identification systems are also updated periodically in response to the evolution of the public conversation, or as we gain more insight into better ways of tagging specific subjects. For example, we might add “vaccine” now, but remove it at a later date after “vaccine” has stopped being used in a COVID-19 related context.

Existing infrastructure

What are the streaming products?

We build products that allow programmatic access of public Twitter data through the Twitter API. This means that any public Tweets you can find in the Twitter app can also be accessed through the API by integrating with our streaming products like filtered stream or sampled stream endpoints. We deliver these Tweets as they are created over a streaming connection.

How did we leverage the existing architecture to build this product?

Due to the large volume of information about COVID-19, we decided to leverage our existing data pipelines that are capable of handling gigabytes of data per second. During the peak of COVID-19 conversation in March, we saw gigabytes of Tweets per second. We recognized that this is a higher volume than we could expect the average researcher to handle on a single stream, and the data processing power required would be immense. To solve this, we split the data into several streams so that researchers could more easily consume them in parallel.

Our underlying infrastructure relies heavily on Kafka, which we use to pass data between our systems. By using streams to track what data has been processed, we can maintain high confidence that we aren’t dropping Tweets due to any individual service going down or being restarted.

This closely follows the way our existing filtered and sampled stream products work, albeit with a different Tweet selection mechanism. 

This Tweet is unavailable
This Tweet is unavailable.

Some of the most involved changes we made had to do with the way we provisioned access to the COVID-19 stream, as we did not have a clear-cut solution to provide this without a standard billing and enterprise configuration. To solve this issue, we had to wire together two existing systems: the enterprise API that we use to deliver Tweets at scale, and our new account access service that would allow these researchers to have the appropriate access to our API. Incidentally, this is something we also needed to solve in our effort to rebuild the Twitter API.

Our product follows these steps to deliver Tweets: 

  1. Identify which Tweets are related to COVID-19 by leveraging an algorithm developed by our semantic analysis team.
  2. Filter the Tweets matching two criteria: (1) Is the Tweet publicly available for the world to see? and (2) Is the Tweet marked as a COVID-19-related Tweet?
  3. Get the researchers from our account access service.
  4. Queue up the delivery of the Tweets into multiple streams for easier consumption by our connected researchers.

Since releasing the COVID-19 Endpoint in April 2020, we’ve made other significant strides in better supporting academic research using Twitter data. In January 2021, we introduced a new product track tailored specifically for academic researchers. The Academic Research product track includes free access to the full history of public Twitter data, and at significantly higher levels of access than what has been available before. Through this product track, academics will be able to build their own tagging criteria, thereby incorporating the public conversation into any research topic imaginable. Many annotations are available through the filtered stream queries to provide an easy starting point for a variety of topics. This product is the best route for academic researchers to study the public conversation around COVID-19.

Measured and anticipated future impact

Over 100 researchers and scientists from universities and labs around the world are currently using this COVID-19 stream. While many have research projects that are still underway, we have already started to hear about some published research and positive outcomes. Head over to this blog post by the Twitter Academic Research team to learn more about who these researchers are, and some of the work that they have been doing.

Summary

When we realized that COVID-19 was going to be a shaping force on the world, we rallied to make some fundamental adjustments to the way we make Twitter data available, in order to better serve the changing needs of researchers and developers. This was a massive team effort, and we worked in a rapid, agile way to make this stream available. We extended the way Twitter data can be accessed, and starting early 2021 we have used our insights from this process to inform the way we build our first specialized Academic Research product to better serve the needs of the academic research community.

Our goal is to provide more access for academics so they can continue to help make the world, and Twitter, a better place through their research. To stay up to date on our upcoming releases, follow us @TwitterDev.

Acknowledgements

Building a product like this would not have been possible without the teamwork of those working on it: Matthew Dickinson, Sameer Brenn, Brent Halsey, Eric Gonzalez, Shane Hirsekorn, and Jinfull Jeng. Special thanks to Nathalia Oliveira for reviewing this blog.

This Tweet is unavailable
This Tweet is unavailable.
@adrianbchen

Adrian Chen

‎@adrianbchen‎

Software Engineer, Developer & Enterprise Solutions

Only on Twitter