Monte Carlo Estimates of Evaluation Metric Error and Bias

  title={Monte Carlo Estimates of Evaluation Metric Error and Bias},
  author={Mucun Tian and Michael D. Ekstrand},
Traditional offline evaluations of recommender systems apply metrics from machine learning and information retrieval in settings where their underlying assumptions no longer hold. This results in significant error and bias in measures of top-N recommendation performance, such as precision, recall, and nDCG. Several of the specific causes of these errors, including popularity bias and misclassified decoy items, are well-explored in the existing literature. In this paper we survey a range of work… 
Doing Data Right: How Lessons Learned Working with Conventional Data should Inform the Future of Synthetic Data for Recommender Systems
It is argued that explicit attention to dataset design and description will help to avoid past mistakes with dataset bias and evaluation and to explore the full scope of opportunities presented by synthetic data as the authors move into the future.


Training and testing of recommender systems on data missing not at random
It is shown that the absence of ratings carries useful information for improving the top-k hit rate concerning all items, a natural accuracy measure for recommendations, and two performance measures can be estimated, under mild assumptions, without bias from data even when ratings are missing not at random (MNAR).
Statistical biases in Information Retrieval metrics for recommender systems
This paper lays out an experimental configuration framework upon which to identify and analyse specific statistical biases arising in the adaptation of Information Retrieval metrics to recommendation tasks, namely sparsity and popularity biases.
Should I Follow the Crowd?: A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems
A crowdsourced dataset devoid of the usual biases displayed by common publicly available data is built, in which contradictions between the accuracy that would be measured in a common biased offline experimental setting, and the actual accuracy that can be measured with unbiased observations are illustrated.
Precision-oriented evaluation of recommender systems: an algorithmic comparison
In three experiments with three state-of-the-art recommenders, four of the evaluation methodologies are consistent with each other and differ from error metrics, in terms of the comparative recommenders' performance measurements.
Sturgeon and the Cool Kids: Problems with Random Decoys for Top-N Recommender Evaluation
This work explores the random decoy strategy through both a theoretical treatment and an empirical study, but finds little evidence to guide its tuning and shows that it has complex and deleterious interactions with popularity bias.
Top-N Recommendation with Missing Implicit Feedback
A missing data model for implicit feedback is discussed and a novel evaluation measure oriented towards Top-N recommendation is proposed, which admits unbiased estimation under that model, unlike the popular Normalized Discounted Cumulative Gain (NDCG) measure.
Performance of recommender algorithms on top-n recommendation tasks
An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task, and new variants of two collaborative filtering algorithms are offered.
Offline A/B Testing for Recommender Systems
This work proposes a new counterfactual estimator and provides a benchmark of the different estimators showing their correlation with business metrics observed by running online A/B tests on a large-scale commercial recommender system.
Recommender system performance evaluation and prediction an information retrieval perspective
This thesis investigates the definition and formalisation of performance predic- tion methods for recommender systems, and studies adaptations of search performance predictors from the Information Retrieval field, and proposes new pre- dictors based on theories and models from Information Theory and Social Graph Theory.
Improving recommendation lists through topic diversification
This work presents topic diversification, a novel method designed to balance and diversify personalized recommendation lists in order to reflect the user's complete spectrum of interests, and introduces the intra-list similarity metric to assess the topical diversity of recommendation lists.