For more than a year now since we enhanced our timeline to show the best Tweets for you first, we have worked to improve the underlying algorithms in order to surface content that is even more relevant to you.
Today we are explaining how our ranking algorithm is powered by deep neural networks, leveraging the modeling capabilities and AI platform built by Cortex, one of our in-house AI teams at Twitter. In a nutshell: this means more relevant timelines now, and in the future, as this opens the door for us to use more of the many novelties that the deep learning community has to offer, especially in the areas of NLP (Natural Language Processing), conversation understanding, and media domains.
Your timeline composition before the introduction of the ranking algorithm is easy to describe: all the Tweets from the people you follow since your last visit were gathered and shown in reverse-chronological order. Although the concept is simple to grasp, reliably serving this experience to the hundreds of millions of people on Twitter is an enormous infrastructural and operational challenge.
With ranking, we add an extra twist. Right after gathering all Tweets, each is scored by a relevance model. The model’s score predicts how interesting and engaging a Tweet would be specifically to you. A set of highest-scoring Tweets is then shown at the top of your timeline, with the remainder shown directly below. Depending on the number of candidate Tweets we have available for you and the amount of time since your last visit, we may choose to also show you a dedicated “In case you missed it” module. This modules meant to contain only a small handful of the very most relevant Tweets ordered by their relevance score, whereas the ranked timeline contains relevant Tweets ordered by time. The intent is to let you see the best Tweets at a glance first before delving into the lengthier time-ordered sections.
In order to predict whether a particular Tweet would be engaging to you, our models consider characteristics (or features) of:
Our list of considered features and their varied interactions keeps growing, informing our models of ever more nuanced behavior patterns.
Every time you open your phone or refresh your timeline, we score every Tweet from the people you follow since your last visit to determine what Tweets to show you at the top. This scoring step imposes an even greater computational demand on timelines serving infrastructure, as we are now scoring thousands of Tweets per second to satisfy all of the timeline requests. Though richer models can lead to better Tweet selection quality, for a real-time company like Twitter, speed is just as important as quality. The unique challenge is to perform scoring quickly enough to instantly serve Tweets back to the people viewing their timelines, yet have powerful enough models to allow for the best possible quality and future improvements.
As you can see, choosing the right approach to building and running prediction models can have profound consequences on the experience of every individual using Twitter.
As we concluded in the previous section, prediction models have to meet many requirements before they can be run in production at Twitter’s scale. This includes:
We measure a model’s quality in two ways. First, we evaluate the model using a well-defined accuracy metric we compute during model training. This measure tells us how well the model performs its task – specifically, giving engaging Tweets a high score. While final model accuracy is a good early indicator, it cannot be used alone to reliably predict how people using Twitter will react to seeing Tweets picked out by that model.
Impact on people using Twitter is typically measured by running one or more A/B tests and comparing the results between experiment buckets. The set of metrics we use here usually relate more directly to usage and enjoyment of Twitter. For instance, we may track the number of engagements per user, or the total amount of time individuals spend on Twitter. Once an A/B test is finished, we can not only conclude whether a new model led to a more engaging and enjoyable experience, but also accurately measure the magnitude of the improvement. In large scale systems such as timeline ranking at Twitter, a small improvement in model quality can have a big impact on how people experience the product.
Finally, even if the desired real-time speed could be achieved for a given model quality, launching that model would be subject to the same trade-off analysis as any new feature. We would want to know the impact of the model and weigh that against any increase in the cost of running the model. The added cost can come from higher hardware utilization, or more complicated operation and support.
Besides the prediction models themselves, a similar set of requirements is applied to the machine learning frameworks – that is, a set of tools that allows one to define, train, evaluate, and launch a prediction model. Specifically, we pay attention to:
Note that in the beginning of any effort that relies on machine learning, having the best model (regardless of how it was generated) often outweighs any other considerations. After all, we want to prove that using prediction models altogether is useful.
However, as the prediction pipeline matures, ML framework’s ease-of-use, scalability, and extendibility become much more important. A brittle and complicated model that is understood or can be extended by only a few engineers is a bad long-term bet, even if it has a slight edge in performance. As more work goes into data mining, feature engineering, and rapid experimentation, a system’s core engineering characteristics become paramount. A stable and flexible framework is more likely to lead to repeatable performance gains. With the impressive amount of new algorithms and model architectures discovered by the AI community, betting on a platform that natively supports deep learning and complex graphs is key to leveraging the promises of that work.
Thanks to early results on image and language understanding tasks, deep learning became a must-have for many tech companies. Large research teams were built from the ground up, and many ambitious projects were launched using deep learning in various contexts.
As a consequence, many new model architectures were devised to solve domain-specific problems. The gap between the capabilities of the human brain and the implemented algorithm became narrower. One central reason for this explosion and diversification is the fact that deep learning is intrinsically modular. Deep learning modules can be composed in various ways (stacked, concatenated, etc.) to form a computational graph. The parameters of this graph can then be learned, typically by using back-propagation and SGD (Stochastic Gradient Descent) on mini-batches.
The “low-level” modules used here as pieces can be anything, as long as the implementation provides a way to compute the output from the input and the necessary gradients. As a matter of fact, the most recent libraries in the field (torch- autograd , PyTorch, TensorFlow) go as far as providing this implementation for the most basic operations that one can execute on chunks of data, typically letting the user specify the algorithm in its preferred tensor manipulation package, and trusting the library to generate the computational graph itself. Torch- autograd and PyTorch take that a step further: the computational graph can be dynamic and change from one mini-batch to the other.
On top of the fact that these computational graphs have an impressive modeling power, the techniques used to learn them are very appealing for multiple reasons, but especially because training is scalable. Since these models learn from mini-batches of data, the total dataset size can be arbitrarily large.
Let’s get back to the case at hand. Tweet ranking lives in a different domain than what most researchers and deep learning algorithm usually focus on. This is mostly because the data is inherently sparse. For multiple reasons including availability and latency requirements, the presence of a feature cannot be guaranteed for each data record going through the model.
Usually, these problems are solved using other types of algorithm such as decision trees, logistic regression, feature crossing and discretization. This is, in fact, what our timelines ranking effort used originally.
For all the above reasons, we were convinced that deep learning could do better. However, in order to use deep learning in production, we had to make sure that the results were at least as good, and that the models and the training procedure were comparably fast.
The Cortex team, which works on Twitter’s deep learning platform, made the following adjustments and improvements in order to satisfy these constraints:
All the hard work of building out and adjusting a complete deep learning platform has begun to pay off. On the task of timeline ranking, deep learning models achieved significant gains in offline model accuracy evaluations. These gains persisted consistently throughout the regular model lifecycle, including the introduction of new features and extending the model to predicting new types of engagement. This proved to us that the deep learning approach is stable, and that it generalizes well. More importantly, online experiments have also shown significant increases in metrics such as Tweet engagement, and time spent on the platform. And as we’ve shared before during our previous earnings, the updated timeline has in part driven increases in both audience and engagement on Twitter.
As we’ve mentioned, another area of impact important to evaluate is the end-to-end framework experience. The ultimate vision was a unified, flexible, and fast framework that allows easy composition of many available deep learning techniques and modules as well as established ML methods. It is precisely this fluidity that allowed rapid experimentation around timeline ranking task and ultimately lead to models of superior quality.
This vision continues to pay off as more and more teams at Twitter are incorporating deep learning into their modeling stack.
Using deep learning as the central modeling component in timeline ranking already gives great results in a production setting. However, the other main reason why Twitter made this change is to open the door to further improvements. In the field of machine learning, deep learning and the development of AI-related work these last few years has led to an unprecedented (and ongoing) burgeoning of new ideas and algorithms. We believe it is critical to let our ML-powered products potentially benefit from the full range of what is out there. We can do so by using an extensible platform that natively supports deep learning.
In the long run, this should help us better understand every individual Tweet and interactions on Twitter in order to be more relevant for everyone, in real-time.
We thank Twitter’s leadership for supporting the agenda of a unified ML/DL platform.
We also thank the Cortex and timelines teams for their joint work to bring the benefits of deep learning to everyone using Twitter. Special thanks to:
Did someone say … cookies?