The 2021 RecSys Challenge Dataset: Fairness is not optional

@article{Belli2021The2R,
  title={The 2021 RecSys Challenge Dataset: Fairness is not optional},
  author={Luca Belli and Alykhan Tejani and Frank Portman Alexandre Lung-Yut-Fong Ben Chamberlain and Yuanpu Xie and Kristian Lum and Jonathan J. Hunt and Michaela Bronstein and Vito Walter Anelli and Saikishore Kalloori and Bruce Ferwerda and Wenzhe Shi},
  journal={RecSysChallenge '21: Proceedings of the Recommender Systems Challenge 2021},
  year={2021}
}
After the success the RecSys 2020 Challenge, we are describing a novel and bigger dataset that was released in conjunction with the ACM RecSys Challenge 2021. This year’s dataset is not only bigger (~1B data points, a 5 fold increase), but for the first time it take into consideration fairness aspects of the challenge. Unlike many static datsets, a lot of effort went into making sure that the dataset was synced with the Twitter platform: if a user deleted their content, the same content would… 

Figures and Tables from this paper

Lightweight and Scalable Model for Tweet Engagements Predictions in a Resource-constrained Environment
TLDR
This paper provides an overview of the approach used as team Trial&Error for the ACM RecSys Challenge 2021, and the final model, an optimized LightGBM, allowed the team to reach the 4th position in the final leaderboard and to rank 1st among the academic teams.
Addressing the cold-start problem with a two-branch architecture for fair tweet recommendation
TLDR
The users’ popularity, as well as the first and last words of the tweet text, turned out to be the best features for the method and it obtained the 5th place in the final ranking and won the 2nd prize in the academic category.
Synerise at RecSys 2021: Twitter user engagement prediction with a fast neural model
In this paper we present our 2nd place solution to ACM RecSys 2021 Challenge organized by Twitter. The challenge aims to predict user engagement for a set of tweets, offering an exceptionally large
Team JKU-AIWarriors in the ACM Recommender Systems Challenge 2021: Lightweight XGBoost Recommendation Approach Leveraging User Features
TLDR
The proposed system, while being lightweight and computationally efficient, still performs reasonably on the task of predicting user interactions with microblogs, yielding a mean relative cross entropy of 13.12 and a mean average precision of 28.72% over all four prediction targets.
User Engagement Modeling with Deep Learning and Language Models
TLDR
This paper uses a hybrid pipeline and leverage gradient boosting, neural network classifiers and multi-lingual language models to maximize performance in the 2021 ACM Recsys challenge, and achieves strong results placing 3’rd on the public leaderboard.
GPU Accelerated Boosted Trees and Deep Neural Networks for Better Recommender Systems
TLDR
This paper presents the 1st place solution of the ACM RecSys 2021 challenge, an ensemble of stacked models, using in total 5 XGBoost models and 3 neural networks, and analyzes the benefits of a GPU-accelerated production environment.
Lessons from the AdKDD’21 Privacy-Preserving ML Challenge
TLDR
The industry needs either alternate designs for private data sharing or a breakthrough in learning with aggregated data only to keep ad relevance at a reasonable level and a key finding is that learning models on large, aggregatedData in the presence of a small set of unaggregated data points can be surprisingly efficient and cheap.

References

SHOWING 1-10 OF 15 REFERENCES
Quantifying the Impact of User Attentionon Fair Group Representation in Ranked Lists
TLDR
This work introduces a novel metric for auditing group fairness in ranked lists, and shows that determining fairness of a ranked output necessitates knowledge (or a model) of the end-users of the particular service.
Privacy-Aware Recommender Systems Challenge on Twitter's Home Timeline.
TLDR
The key challenges faced by researchers and professionals striving to predict user engagements are touched on, and the key aspects of the RecSys 2020 Challenge that was organized by ACM RecSys in partnership with Twitter using this dataset are described.
How To Break Anonymity of the Netflix Prize Dataset
TLDR
This work presents a new class of statistical de-anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on, and demonstrates that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber's record in the dataset.
Fairness of Exposure in Rankings
TLDR
This work proposes a conceptual and computational framework that allows the formulation of fairness constraints on rankings in terms of exposure allocation, and develops efficient algorithms for finding rankings that maximize the utility for the user while provably satisfying a specifiable notion of fairness.
Equity of Attention: Amortizing Individual Fairness in Rankings
TLDR
The challenge of achieving amortized individual fairness subject to constraints on ranking quality as an online optimization problem is formulated and solved as an integer linear program and it is demonstrated that the method can improve individual fairness while retaining high ranking quality.
Designing Fair Ranking Schemes
TLDR
This paper develops a system that helps users choose criterion weights that lead to greater fairness, and shows how to efficiently identify regions in this space that satisfy a broad range of fairness criteria.
Fairness and Abstraction in Sociotechnical Systems
TLDR
This paper outlines this mismatch with five "traps" that fair-ML work can fall into even as it attempts to be more context-aware in comparison to traditional data science and suggests ways in which technical designers can mitigate the traps through a refocusing of design in terms of process rather than solutions.
The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning
TLDR
It is argued that it is often preferable to treat similarly risky people similarly, based on the most statistically accurate estimates of risk that one can produce, rather than requiring that algorithms satisfy popular mathematical formalizations of fairness.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
...
...