Sturgeon and the Cool Kids: Problems with Random Decoys for Top-N Recommender Evaluation
@inproceedings{Ekstrand2017SturgeonAT, title={Sturgeon and the Cool Kids: Problems with Random Decoys for Top-N Recommender Evaluation}, author={Michael D. Ekstrand and Vaibhav Mahant}, booktitle={FLAIRS Conference}, year={2017} }
Top-N evaluation of recommender systems, typically carried out using metrics from information retrieval or machine learning, has several challenges. Two of these challenges are popularity bias, where the evaluation intrinsically favors algorithms that recommend popular items, and misclassified decoys, where items for which no user relevance is known are actually relevant to the user, but the evaluation is unaware and penalizes the recommender for suggesting them. One strategy for mitigating the…
5 Citations
Monte Carlo Estimates of Evaluation Metric Error and Bias
- Computer Science
- 2018
Simulation of the recommender data generation and evaluation processes is used to quantify the extent of evaluation metric errors and assess their sensitivity to various assumptions.
Hands on Data and Algorithmic Bias in Recommender Systems
- Computer ScienceUMAP
- 2020
A range of techniques for evaluating and mitigating the impact of biases on the recommended lists, including pre-, in-, and post-processing procedures are covered.
Transparent, Scrutable and Explainable User Models for Personalized Recommendation
- Computer ScienceSIGIR
- 2019
This paper presents a new set-based recommendation technique that permits the user model to be explicitly presented to users in natural language, empowering users to understand recommendations made and improve the recommendations dynamically.
Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse
- Computer SciencePerspectives@RecSys
- 2021
It is argued that the use of statistical inference is a key component of the evaluation process that has not been given sufficient attention, and presents several challenges that exist for inference in recommendation experiment which buttresses the need for empirical studies to aid with appropriately selecting and applying statistical inference techniques.
Estimating Error and Bias in Offline Evaluation Results
- Computer ScienceCHIIR
- 2020
It is found that missing data in the rating or observation process causes the evaluation protocol to systematically mis-estimate metric values, and in some cases erroneously determine that a popularity-based recommender outperforms even a perfect personalized recommender.
References
SHOWING 1-10 OF 26 REFERENCES
Performance of recommender algorithms on top-n recommendation tasks
- Computer ScienceRecSys '10
- 2010
An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task, and new variants of two collaborative filtering algorithms are offered.
Precision-oriented evaluation of recommender systems: an algorithmic comparison
- Computer ScienceRecSys '11
- 2011
In three experiments with three state-of-the-art recommenders, four of the evaluation methodologies are consistent with each other and differ from error metrics, in terms of the comparative recommenders' performance measurements.
Performance prediction and evaluation in recommender systems: An information retrieval perspective
- Computer Science
- 2012
This thesis investigates the definition and formalisation of performance predic-tion methods for recommender systems, and evaluates the quality of the proposed solutions in terms of the correlation between the predicted and the observed performance on test data.
Improving recommendation lists through topic diversification
- Computer ScienceWWW '05
- 2005
This work presents topic diversification, a novel method designed to balance and diversify personalized recommendation lists in order to reflect the user's complete spectrum of interests, and introduces the intra-list similarity metric to assess the topical diversity of recommendation lists.
Evaluating collaborative filtering recommender systems
- Computer ScienceTOIS
- 2004
The key decisions in evaluating collaborative filtering recommender systems are reviewed: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole.
A Survey of Accuracy Evaluation Metrics of Recommendation Tasks
- Computer ScienceJ. Mach. Learn. Res.
- 2009
This paper reviews the proper construction of offline experiments for deciding on the most appropriate algorithm, and discusses three important tasks of recommender systems, and classify a set of appropriate well known evaluation metrics for each task.
Factorization meets the neighborhood: a multifaceted collaborative filtering model
- Computer ScienceKDD
- 2008
The factor and neighborhood models can now be smoothly merged, thereby building a more accurate combined model and a new evaluation metric is suggested, which highlights the differences among methods, based on their performance at a top-K recommendation task.
Comparative recommender system evaluation: benchmarking recommendation frameworks
- Computer ScienceRecSys '14
- 2014
This work compares common recommendation algorithms as implemented in three popular recommendation frameworks and shows the necessity of clear guidelines when reporting evaluation of recommender systems to ensure reproducibility and comparison of results.
Being accurate is not enough: how accuracy metrics have hurt recommender systems
- Computer ScienceCHI Extended Abstracts
- 2006
This paper proposes informal arguments that the recommender community should move beyond the conventional accuracy metrics and their associated experimental methodologies, and proposes new user-centric directions for evaluating recommender systems.
Improving regularized singular value decomposition for collaborative filtering
- Computer Science
- 2007
Different efficient collaborative filtering techniques and a framework for combining them to obtain a good prediction are described, predicting users’ preferences for movies with error rate 7.04% better on the Netflix Prize dataset than the reference algorithm Netflix Cinematch.