• Corpus ID: 2931043

Sturgeon and the Cool Kids: Problems with Random Decoys for Top-N Recommender Evaluation

  title={Sturgeon and the Cool Kids: Problems with Random Decoys for Top-N Recommender Evaluation},
  author={Michael D. Ekstrand and Vaibhav Mahant},
  booktitle={FLAIRS Conference},
Top-N evaluation of recommender systems, typically carried out using metrics from information retrieval or machine learning, has several challenges. Two of these challenges are popularity bias, where the evaluation intrinsically favors algorithms that recommend popular items, and misclassified decoys, where items for which no user relevance is known are actually relevant to the user, but the evaluation is unaware and penalizes the recommender for suggesting them. One strategy for mitigating the… 

Figures and Tables from this paper

Monte Carlo Estimates of Evaluation Metric Error and Bias
Simulation of the recommender data generation and evaluation processes is used to quantify the extent of evaluation metric errors and assess their sensitivity to various assumptions.
Hands on Data and Algorithmic Bias in Recommender Systems
A range of techniques for evaluating and mitigating the impact of biases on the recommended lists, including pre-, in-, and post-processing procedures are covered.
Transparent, Scrutable and Explainable User Models for Personalized Recommendation
This paper presents a new set-based recommendation technique that permits the user model to be explicitly presented to users in natural language, empowering users to understand recommendations made and improve the recommendations dynamically.
Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse
It is argued that the use of statistical inference is a key component of the evaluation process that has not been given sufficient attention, and presents several challenges that exist for inference in recommendation experiment which buttresses the need for empirical studies to aid with appropriately selecting and applying statistical inference techniques.
Estimating Error and Bias in Offline Evaluation Results
It is found that missing data in the rating or observation process causes the evaluation protocol to systematically mis-estimate metric values, and in some cases erroneously determine that a popularity-based recommender outperforms even a perfect personalized recommender.


Performance of recommender algorithms on top-n recommendation tasks
An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task, and new variants of two collaborative filtering algorithms are offered.
Precision-oriented evaluation of recommender systems: an algorithmic comparison
In three experiments with three state-of-the-art recommenders, four of the evaluation methodologies are consistent with each other and differ from error metrics, in terms of the comparative recommenders' performance measurements.
Performance prediction and evaluation in recommender systems: An information retrieval perspective
This thesis investigates the definition and formalisation of performance predic-tion methods for recommender systems, and evaluates the quality of the proposed solutions in terms of the correlation between the predicted and the observed performance on test data.
Improving recommendation lists through topic diversification
This work presents topic diversification, a novel method designed to balance and diversify personalized recommendation lists in order to reflect the user's complete spectrum of interests, and introduces the intra-list similarity metric to assess the topical diversity of recommendation lists.
Evaluating collaborative filtering recommender systems
The key decisions in evaluating collaborative filtering recommender systems are reviewed: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole.
A Survey of Accuracy Evaluation Metrics of Recommendation Tasks
This paper reviews the proper construction of offline experiments for deciding on the most appropriate algorithm, and discusses three important tasks of recommender systems, and classify a set of appropriate well known evaluation metrics for each task.
Factorization meets the neighborhood: a multifaceted collaborative filtering model
The factor and neighborhood models can now be smoothly merged, thereby building a more accurate combined model and a new evaluation metric is suggested, which highlights the differences among methods, based on their performance at a top-K recommendation task.
Comparative recommender system evaluation: benchmarking recommendation frameworks
This work compares common recommendation algorithms as implemented in three popular recommendation frameworks and shows the necessity of clear guidelines when reporting evaluation of recommender systems to ensure reproducibility and comparison of results.
Being accurate is not enough: how accuracy metrics have hurt recommender systems
This paper proposes informal arguments that the recommender community should move beyond the conventional accuracy metrics and their associated experimental methodologies, and proposes new user-centric directions for evaluating recommender systems.
Improving regularized singular value decomposition for collaborative filtering
Different efficient collaborative filtering techniques and a framework for combining them to obtain a good prediction are described, predicting users’ preferences for movies with error rate 7.04% better on the Netflix Prize dataset than the reference algorithm Netflix Cinematch.