Estimating Error and Bias in Offline Evaluation Results

  title={Estimating Error and Bias in Offline Evaluation Results},
  author={Mucun Tian and Michael D. Ekstrand},
  journal={Proceedings of the 2020 Conference on Human Information Interaction and Retrieval},
  • Mucun TianMichael D. Ekstrand
  • Published 26 January 2020
  • Computer Science
  • Proceedings of the 2020 Conference on Human Information Interaction and Retrieval
Offline evaluations of recommender systems attempt to estimate users' satisfaction with recommendations using static data from prior user interactions. These evaluations provide researchers and developers with first approximations of the likely performance of a new system and help weed out bad ideas before presenting them to users. However, offline evaluation cannot accurately assess novel, relevant recommendations, because the most novel items were previously unknown to the user, so they are… 

Figures and Tables from this paper

LensKit for Python: Next-Generation Software for Recommender Systems Experiments

The next generation of the LensKit project is presented, re-envisioning the original tool's objectives as flexible Python package for supporting recommender systems research and development.

Multiversal Simulacra: Understanding Hypotheticals and Possible Worlds Through Simulation

My research agenda is particularly concerned with understanding the human biases that affect information retrieval and recommender systems, and quantifying their impact on the system’s operation,

Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse

It is argued that the use of statistical inference is a key component of the evaluation process that has not been given sufficient attention, and presents several challenges that exist for inference in recommendation experiment which buttresses the need for empirical studies to aid with appropriately selecting and applying statistical inference techniques.

Quality Metrics in Recommender Systems: Do We Calculate Metrics Consistently?

Quality metrics used for recommender systems evaluation are investigated and it is found that Precision is the only metric universally understood among papers and libraries, while other metrics may have different interpretations.

SimuRec: Workshop on Synthetic Data and Simulation Methods for Recommender Systems Research

A workshop to bring together researchers and practitioners interested in simulating recommender systems and their data to discuss the state of the art of such research and the pressing open methodological questions resulted in a report authored by the participants.



Should I Follow the Crowd?: A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems

A crowdsourced dataset devoid of the usual biases displayed by common publicly available data is built, in which contradictions between the accuracy that would be measured in a common biased offline experimental setting, and the actual accuracy that can be measured with unbiased observations are illustrated.

Offline A/B Testing for Recommender Systems

This work proposes a new counterfactual estimator and provides a benchmark of the different estimators showing their correlation with business metrics observed by running online A/B tests on a large-scale commercial recommender system.

Recommender system performance evaluation and prediction an information retrieval perspective

This thesis investigates the definition and formalisation of performance predic- tion methods for recommender systems, and studies adaptations of search performance predictors from the Information Retrieval field, and proposes new pre- dictors based on theories and models from Information Theory and Social Graph Theory.

Training and testing of recommender systems on data missing not at random

It is shown that the absence of ratings carries useful information for improving the top-k hit rate concerning all items, a natural accuracy measure for recommendations, and two performance measures can be estimated, under mild assumptions, without bias from data even when ratings are missing not at random (MNAR).

Precision-oriented evaluation of recommender systems: an algorithmic comparison

In three experiments with three state-of-the-art recommenders, four of the evaluation methodologies are consistent with each other and differ from error metrics, in terms of the comparative recommenders' performance measurements.

Performance of recommender algorithms on top-n recommendation tasks

An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task, and new variants of two collaborative filtering algorithms are offered.

Top-N Recommendation with Missing Implicit Feedback

A missing data model for implicit feedback is discussed and a novel evaluation measure oriented towards Top-N recommendation is proposed, which admits unbiased estimation under that model, unlike the popular Normalized Discounted Cumulative Gain (NDCG) measure.

Data Pruning in Recommender Systems Research: Best-Practice or Malpractice?

It is found that removing users with less than 20 ratings is equivalent to removing 5% of ratings and 42% of users, and it is concluded that pruning should be avoided, if possible, though more discussion in the community is needed.

Sturgeon and the Cool Kids: Problems with Random Decoys for Top-N Recommender Evaluation

This work explores the random decoy strategy through both a theoretical treatment and an empirical study, but finds little evidence to guide its tuning and shows that it has complex and deleterious interactions with popularity bias.

Recommender Systems Notation: Proposed Common Notation for Teaching and Research

The notation the authors have adopted in their work is described, along with its justification and some discussion of considered alternatives, in hope that it will be useful to others writing and teaching about recommender systems.