We recently launched a new trends experience on mobile and web. Certain users will now see additional context with their trends: a short description and at times an accompanying image. While building this new product experience, we also switched the entire trends backend system to a brand new platform. This is the largest engineering undertaking of the trends system since 2008.
Every second on Twitter, thousands of Tweets are sent. Since 2008, trends have been the one-stop window into the Twitter universe, providing users with a great tool to keep up with breaking news, entertainment and social movements.
In last week’s release, we built new mechanisms that enrich trends with additional information to simplify their consumption. In particular, we introduced an online news clustering service that detects and groups breaking news on Twitter. We also replaced the legacy system for detecting text terms that are irregular in volume with a generic, distributed system for identifying anomalies in categorized streams of content. The new system can detect trending terms, URLs, users, videos, images or any application-specific entity of interest.
Adding context to trends
Until recently, trends have only been presented as a simple list of phrases or hashtags, occasionally leaving our users puzzled as to why something like “Alps,” “Top Gear” or “East Village” is trending. Learning more required clicking on the trend and going through related Tweets. With this update, Twitter now algorithmically provides trend context for certain tailored trends users. Not only does this improve usability, but according to our experimental data, it also motivates users to engage even more.
Context for the new trends experience may include the following pieces of data:
When users click on a trend, Tweets that provided the image or description source are shown at the top of the corresponding timeline. This gives tribute to the original source of context and ensures a consistent browsing experience for the user. The trend context is recomputed and refreshed every few minutes, providing users with a summary of the current status of affairs.
News clustering system
Surfacing high quality, relevant and real-time contextual information can be challenging, given the massive amount of Tweets we process. Currently, one major source for descriptions and images are news URLs shared on Twitter.
The news clustering system detects breaking news in real time, based on engagement signals from users. It groups similar news stories into clusters along different news verticals and surfaces real-time conversations and content like top images and Tweets for each cluster. The following diagram provides details about components of this system.
The news clusterer consumes streams of Tweets containing URLs from a set of news domains. Twitter’s URL crawler is then used to fetch the content from the links embedded in each Tweet. The clusterer then builds feature vectors out of crawled article content. Using these feature vectors, an online clustering algorithm is employed to cluster related stories together. The resulting clusters are ranked using various criteria including recency and size. Each cluster maintains a ranked list of metadata, such as top images, top Tweets, top articles and keywords. The ranked clusters are persisted to Manhattan periodically for serving purposes.
The news clustering service polls for updated clusters from Manhattan. An inverted index is built from keywords computed by the clusterer to corresponding clusters. Given a query, this service returns matching clusters, or the top-n ranked clusters available, along with their metadata, which is used to add context to trends.
In addition to making trends more self-explanatory, we have also replaced the entire underlying trends computation system itself with a new real-time distributed solution. Since the old system ran on a single JVM, it could only process a small window of Tweets at a time. It also lacked stable recovery mechanism. The new system is built on a scalable, distributed Storm architecture for streaming MapReduce. It maintains state in both memcached and Manhattan. Tweet input passes through durable Kafka queue for proper recovery on restarts.
The new system consists of two major components. The first component is trends detection. It is built on top of Summingbird, responsible for processing Firehose data, detecting anomalies and surfacing trends candidates. The other component is trends postprocessing, which selects the best trends and decorates them with relevant context data.
Based on Firehose Tweets input, a trends detection job computes trending entities in domains related to languages, geo locations and interesting topics. As shown in the diagram below, it has the following main phases:
Data preparation includes filtering and throttling. Basic filtering removes replies, Tweets with low text quality or containing sensitive content. Anti-spam filtering takes advantage of real-time spam signal available from BotMaker. Throttling removes similar Tweets and ensures contribution to a trend from a single user is limited.
After filtering and throttling, the trending algorithm is where the decision of what domain-entity pairs are trending is made. For this, domain-entity pairs are extracted along with related metadata, and then aggregated into counter objects. Additional pair attributes like entity co-occurrence and top URLs are collected and persisted separately, which are later used for scoring and post-processing.The scorer computes score for entity-domain pairs based on the main counter objects and their associated attribute counters. The tracker then ranks these pairs and saves top ranked results with scores onto Manhattan. These results are trends candidates ready for postprocessing and human evaluation.
Trends postprocessor has the following main functionalities:
The following diagram shows how the postprocessor works:
The scanner periodically loads all available domains and initiates the post-processing operation for each domain.
Depending on the granularity, a domain may be expanded to form a proper sequence in ascending order of scope. For example, a city level domain [San Francisco] will be expanded to List[San Francisco, California, USA, en] that contains the full domain hierarchy, with language en as the most general one.
For domains without sufficient organic trending entities, a backfill process is used to compensate them with data from their ancestors’ domains, after domain expansion.
The folding process is responsible for combining textually different, semantically similar trending entities into a single cluster and selecting the best representative to display to the end user.
Metadata fetcher retrieves data from multiple sources, including the search-blender and news clustering service described earlier to decorate each tending entity with context information. These decorated entities are then persisted in batch for the trends serving layer to pick up.
Looking ahead, we are working hard to improve the quality of trends in multiple ways. Stay tuned! The following people contributed to these updates: Alex Cebrian, Amit Shukla, Brian Larson, Chang Su, Dumitru Daniliuc, Eric Rosenberg, Fabio Resende, Fei Ma, Gabor Cselle, Gilad Mishne, Jay Han, Jerry Marino, Jess Myra, Jingwei Wu, Jinsong Lin, Justin Trobec, Keh-Li Sheng, Kevin Zhao, Kris Merrill, Maer Melo, Mike Cvet, Nipoon Malhotra, Royce Cheng-Yue, Stanislav Nikolov, Suchit Agarwal, Tal Stramer, Todd Jackson, Veljko Skarich, Venu Kasturi, Zhenghua Li, Zijiao Liu.
Did someone say … cookies?