Sturgeon and the Cool Kids: Problems with Random Decoys for Top-N Recommender Evaluation
@inproceedings{Ekstrand2017SturgeonAT, title={Sturgeon and the Cool Kids: Problems with Random Decoys for Top-N Recommender Evaluation}, author={Michael D. Ekstrand and Vaibhav Mahant}, booktitle={The Florida AI Research Society}, year={2017} }
Top-N evaluation of recommender systems, typically carried out using metrics from information retrieval or machine learning, has several challenges. Two of these challenges are popularity bias, where the evaluation intrinsically favors algorithms that recommend popular items, and misclassified decoys, where items for which no user relevance is known are actually relevant to the user, but the evaluation is unaware and penalizes the recommender for suggesting them. One strategy for mitigating the…
8 Citations
Monte Carlo Estimates of Evaluation Metric Error and Bias
- Computer Science
- 2018
Simulation of the recommender data generation and evaluation processes is used to quantify the extent of evaluation metric errors and assess their sensitivity to various assumptions.
Hands on Data and Algorithmic Bias in Recommender Systems
- Computer ScienceUMAP
- 2020
A range of techniques for evaluating and mitigating the impact of biases on the recommended lists, including pre-, in-, and post-processing procedures are covered.
Estimating Error and Bias in Offline Evaluation Results
- Computer ScienceCHIIR
- 2020
It is found that missing data in the rating or observation process causes the evaluation protocol to systematically mis-estimate metric values, and in some cases erroneously determine that a popularity-based recommender outperforms even a perfect personalized recommender.
Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse
- Computer SciencePerspectives@RecSys
- 2021
It is argued that the use of statistical inference is a key component of the evaluation process that has not been given sufficient attention, and presents several challenges that exist for inference in recommendation experiment which buttresses the need for empirical studies to aid with appropriately selecting and applying statistical inference techniques.
Evaluating Recommender Systems: Survey and Framework
- Computer ScienceACM Comput. Surv.
- 2023
The FEVR framework provides a structured foundation to adopt adequate evaluation configurations that encompass this required multi-facetedness and provides the basis to advance in the field.
Best Practices for Top-N Recommendation Evaluation: Candidate Set Sampling and Statistical Inference Techniques
- Computer ScienceCIKM
- 2022
The goal of this project, is to identify, substantiate, and document best practices to improve evaluations to improve recommendation evaluation experiments.
A unifying and general account of fairness measurement in recommender systems
- Computer ScienceInf. Process. Manag.
- 2023
Transparent, Scrutable and Explainable User Models for Personalized Recommendation
- Computer ScienceSIGIR
- 2019
This paper presents a new set-based recommendation technique that permits the user model to be explicitly presented to users in natural language, empowering users to understand recommendations made and improve the recommendations dynamically.
References
SHOWING 1-10 OF 25 REFERENCES
Performance of recommender algorithms on top-n recommendation tasks
- Computer ScienceRecSys '10
- 2010
An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task, and new variants of two collaborative filtering algorithms are offered.
Precision-oriented evaluation of recommender systems: an algorithmic comparison
- Computer ScienceRecSys '11
- 2011
In three experiments with three state-of-the-art recommenders, four of the evaluation methodologies are consistent with each other and differ from error metrics, in terms of the comparative recommenders' performance measurements.
Improving recommendation lists through topic diversification
- Computer ScienceWWW '05
- 2005
This work presents topic diversification, a novel method designed to balance and diversify personalized recommendation lists in order to reflect the user's complete spectrum of interests, and introduces the intra-list similarity metric to assess the topical diversity of recommendation lists.
A Survey of Accuracy Evaluation Metrics of Recommendation Tasks
- Computer ScienceJ. Mach. Learn. Res.
- 2009
This paper reviews the proper construction of offline experiments for deciding on the most appropriate algorithm, and discusses three important tasks of recommender systems, and classify a set of appropriate well known evaluation metrics for each task.
Factorization meets the neighborhood: a multifaceted collaborative filtering model
- Computer ScienceKDD
- 2008
The factor and neighborhood models can now be smoothly merged, thereby building a more accurate combined model and a new evaluation metric is suggested, which highlights the differences among methods, based on their performance at a top-K recommendation task.
Being accurate is not enough: how accuracy metrics have hurt recommender systems
- Computer ScienceCHI Extended Abstracts
- 2006
This paper proposes informal arguments that the recommender community should move beyond the conventional accuracy metrics and their associated experimental methodologies, and proposes new user-centric directions for evaluating recommender systems.
Rethinking the recommender research ecosystem: reproducibility, openness, and LensKit
- Computer ScienceRecSys '11
- 2011
The utility of LensKit is demonstrated by replicating and extending a set of prior comparative studies of recommender algorithms, and a question recently raised by a leader in the recommender systems community on problems with error-based prediction evaluation is investigated.
User perception of differences in recommender algorithms
- Computer ScienceRecSys '14
- 2014
It is found that satisfaction is negatively dependent on novelty and positively dependent on diversity in this setting, and that satisfaction predicts the user's final selection of a recommender that they would like to use in the future.
Item-based collaborative filtering recommendation algorithms
- Computer ScienceWWW '01
- 2001
This paper analyzes item-based collaborative ltering techniques and suggests that item- based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms.
Contrasting Offline and Online Results when Evaluating Recommendation Algorithms
- Computer ScienceRecSys
- 2016
Empirical evidence is presented that the ranking of algorithms based on offline accuracy measurements clearly contradicts the results from the online study with the same set of users, suggesting the external validity of the most commonly applied evaluation methodology is not guaranteed.