What Twitter learned from the Recsys 2020 Challenge


Recommender systems are an important part of modern social networks and e-commerce platforms. They aim to maximize user satisfaction as well as other key business objectives. At the same time, there is a lack of large-scale public social network datasets for the scientific community to use when building and benchmarking new models to tailor content to user interests. In the past year we have worked to address exactly that problem.

Twitter partnered with the RecSys conference to sponsor the 2020 challenge. We released a dataset consisting of Tweets and user engagements over a period of two weeks, with 160 million public Tweets for training and 40 million public Tweets for validation and testing over a period of two weeks.

In this post, we describe the dataset and the three winning entries submitted by Nvidia, Learner, and Wantely teams. We try to make general conclusions about the choices that helped the winners achieve their results, notably:

  • most important features 
  • extremely fast experimentation speed for feature selection and model training 
  • adversarial validation for generalization 
  • use of content features 
  • use of decision trees over neural networks 

We hope that these findings will be useful to the wider research community and inspire future research directions in recommender systems.

For the challenge we asked the participants to predict the probability of a user engaging with any of the four interactions: Like, Reply, Retweet, and Quote Tweet. The submissions were evaluated against two metrics: relative cross entropy (RCE) with respect to a simple baseline we provided, and area under the Precision-Recall curve (PR-AUC).

This Tweet is unavailable
This Tweet is unavailable.

Special attention is given to maintaining the dataset in sync with the Twitter platform. The dataset reflects changes in the platform, e.g. when a Tweet is deleted, a user makes their profile private or deletes it altogether. Submissions are also re-evaluated and the leaderboard is updated with the re-calculated metrics. For more details about the challenge and the dataset, please refer to our paper

This Tweet is unavailable
This Tweet is unavailable.

This year’s challenge was particularly competitive, with over 1000 registered users. The participants actively submitted solutions throughout the challenge and modified their team composition during the first phase of the challenge (in line with the submission guidelines). The final phase had 20 contenders, with an average team size of four members. Moreover, the teams developed 127 different methods attempting to win the challenge. The activity was high throughout the challenge and spiked in the last days when the participants refined their submissions. The final results appear in the leaderboard.

The accompanying RecSys Challenge 2020 Workshop received 12 papers, which were reviewed by the program committee. Nine of those papers were accepted. You can find the leaderboard for the final phase at this link.

First place: Nvidia (GPU Accelerated Feature Engineering and Training for Recommender Systems)  

Short summary: Nvidia’s paper² describes training xgboost models to predict each of the interaction events.he overall focus is on creating useful features for this model. It highlights quick feature extraction and model training as being key to the success of the approach. The paper provides a list of the 15 most useful features for each of the 4 models in the appendix.

Quickly extracting features from the dataset and retraining is the key difference between the winner and runner ups. Both the feature engineering pipeline and the training pipeline took less than a minute to run. In addition to that, target encoding (mean encoding + additive smoothing) is used for different categorical features and combinations of features including the mean of the target for these combinations. The authors also create categorical features from the content of the Tweets (e.g. the two most popular words and the two least popular words). Adversarial validation for feature importance and selection is used to prevent overfitting by selecting more generalizable features. Ensemble methods of tree based models are used to produce the final model.  

Second place: Learner (Predicting Twitter Engagement With Deep Language Models)

Short summary: Learner’s approach³ blends deep learning with Gradient Boosted Decision Trees (GBDT). The paper focuses on the creation of different features. They engineered 467 features using heuristic methods. They also created text representation of the Tweets using BERT and XLM-R. They use both the target Tweet text as well as the text of recently engaged Tweets.

The key difference between this entry and others is the use of pre-trained NLP models (BERT and XLM-R) and fine-tuning. The first layer of fine-tuning is done in an unsupervised fashion. Next, the language model is combined with other features to fine tune in the supervised setting. The model is an MLP with four heads (one for each engagement type). The paper also uses attention to create an embedding of the user’s past ten interactions.  Combining the embedding of each of these using attention with the target Tweet as the key. Additionally, heuristic features are used such as different representations of the engaging user, Tweet creator, Tweet features, and user-creator interaction features. Like other entries, this paper uses xgboost for feature engineering and selection, and applies the Yeo-Johnson transformation to categorical features and unnormalized continuous features.

Third place: Wantely (A Stacking Ensemble Model for Prediction of Multi-type Tweet Engagements) 

Short summary: Wantely’s submission⁴ proposes a two-stage approach to predicting Tweet engagements. The first-stage classifiers are lightweight and only use features that generalize across the different objectives (Like, Retweet, etc) and have similar training/testing accuracy. The second-stage classifiers use the output of the lightweight classifiers as features along with the objective-specific features. 

In this paper, an upstream generalizable model generates features that a downstream model consumes. By doing so, the paper argues that the downstream model for every engagement type is able to benefit from the data on all the other engagements by consuming the predictions of the common upstream model. In addition to that, the paper identifies which features are generalizable by directly assessing the distribution gap of the features between the training and testing datasets with adversarial validation, as in the Nvidia entry.

Learnings from the Competition

There are many shared insights across all the submissions, it’s also helpful to highlight the main themes:

Useful features used in the winning models — target encoding is king.

  • Target encoding, the process of replacing a categorical variable with the mean of the target variable, makes the problem simpler.  It was used for both user and author id, thus encoding a mean engagement rate of a user.
  • Lots of feature crossing. The full list of features with importance for different objectives like Retweet/Reply is available in the appendix of the Nvidia paper

Fast experimentation for feature selection

  • The ability to test many hypotheses rapidly has always been integral to data science competitions, and proves decisive in this challenge once again. The Nvidia team was able to run the entire pipeline on GPU. This allowed them to train a model (including feature engineering) in only two minutes and 18 seconds vs many hours on CPU. 

Cope with overfitting with adversarial validation

  • One common technique used by competitors was to build a discriminator to predict the difference between training and test/validation set. We can help the model to better generalize by removing the most important features, based on the importance score used in feature selection for the model. This technique will avoid overfitting to the training data.

Use of content features

  • A significant difference between this year’s dataset and previous are the content features we provided. There were sophisticated uses of BERT for content features in two of the three winning papers. Deep learning for NLP has demonstrated its usefulness for recommender systems, although we believe there is more room for improvement in this area.

GBDT vs Deep Learning

  • A significant advantage of GBDT touched on by all papers, is that there is no need to normalize and figure out the scale of individual features in a decision tree model (this contributes to faster iteration speed).


In domains like computer vision and NLP, deep learning models have demonstrated impressive advances by leveraging CNNs and transformers. Based on the result of this challenge, we still do not understand what makes a good architecture for deep learning in recommender systems. We call on the research community to collectively find the best deep learning architecture for recommender systems.

We also note that while we did only evaluate the submissions on the performance of the model, we have many other constraints in our production systems. Latency is a big one for us: models need to score Tweets within milliseconds. The use of ensemble methods needs to be carefully examined in this setting. The additive latency with each step in the ensemble may cause them to be too slow for our purposes.  

We are grateful to all the participants and our colleagues that made this challenge possible. We believe that releasing large scale datasets will help unlock new advances in the field of recommender systems. Twitter is now more than ever committed to helping external research efforts and has recently released new API endpoints for academic researchers, to help foster further exploration and collaborations.


This blog post was co-authored by Luca Belli, Apoorv Sharma, Yuanpu Xie, Ying Xiao, Dan Shiebler, Max Hansmire, Michael Bronstein, and Wenzhe Shi from Twitter Cortex. 

The RecSys Challenge was organized by Nazareno Andrade, Walter Anelli, Amra Delic, Jessie Smith, Gabriele Sottocornola with contributions to the dataset from Luca Belli, Michael Bronstein, Alexandre Lung Yut Fong, Sofia Ira Ktena, Frank Portman, Alykhan Tejani, Yuanpu Xie, Xiao Zhu, and Wenzhe Shi.


¹ L. Belli et al. Privacy-Aware Recommender Systems Challenge on Twitter’s Home Timeline (2020) arXiv:2004.13715.

² B. Schifferer et al. GPU Accelerated Feature Engineering and Training for Recommender Systems (2020). Proc. Recommender Systems Challenge 2020. 

³ M. Volkovs et al. Predicting Twitter Engagement With Deep Language Models (2020). Proc. Recommender Systems Challenge 2020. 

⁴ S. Goda et al. A Stacking Ensemble Model for Prediction of Multi-Type Tweet Engagements (2020). Proc. Recommender Systems Challenge 2020.


This Tweet is unavailable
This Tweet is unavailable.

Wenzhe Shi


ML Engineering Manager, Cortex


Luca Belli


Senior ML Engineer, META

Only on Twitter