• Corpus ID: 2931043

Sturgeon and the Cool Kids: Problems with Random Decoys for Top-N Recommender Evaluation

  title={Sturgeon and the Cool Kids: Problems with Random Decoys for Top-N Recommender Evaluation},
  author={Michael D. Ekstrand and Vaibhav Mahant},
  booktitle={The Florida AI Research Society},
Top-N evaluation of recommender systems, typically carried out using metrics from information retrieval or machine learning, has several challenges. Two of these challenges are popularity bias, where the evaluation intrinsically favors algorithms that recommend popular items, and misclassified decoys, where items for which no user relevance is known are actually relevant to the user, but the evaluation is unaware and penalizes the recommender for suggesting them. One strategy for mitigating the… 

Figures and Tables from this paper

Monte Carlo Estimates of Evaluation Metric Error and Bias

Simulation of the recommender data generation and evaluation processes is used to quantify the extent of evaluation metric errors and assess their sensitivity to various assumptions.

Hands on Data and Algorithmic Bias in Recommender Systems

A range of techniques for evaluating and mitigating the impact of biases on the recommended lists, including pre-, in-, and post-processing procedures are covered.

Estimating Error and Bias in Offline Evaluation Results

It is found that missing data in the rating or observation process causes the evaluation protocol to systematically mis-estimate metric values, and in some cases erroneously determine that a popularity-based recommender outperforms even a perfect personalized recommender.

Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse

It is argued that the use of statistical inference is a key component of the evaluation process that has not been given sufficient attention, and presents several challenges that exist for inference in recommendation experiment which buttresses the need for empirical studies to aid with appropriately selecting and applying statistical inference techniques.

Evaluating Recommender Systems: Survey and Framework

The FEVR framework provides a structured foundation to adopt adequate evaluation configurations that encompass this required multi-facetedness and provides the basis to advance in the field.

Best Practices for Top-N Recommendation Evaluation: Candidate Set Sampling and Statistical Inference Techniques

The goal of this project, is to identify, substantiate, and document best practices to improve evaluations to improve recommendation evaluation experiments.

Transparent, Scrutable and Explainable User Models for Personalized Recommendation

This paper presents a new set-based recommendation technique that permits the user model to be explicitly presented to users in natural language, empowering users to understand recommendations made and improve the recommendations dynamically.



Performance of recommender algorithms on top-n recommendation tasks

An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task, and new variants of two collaborative filtering algorithms are offered.

Precision-oriented evaluation of recommender systems: an algorithmic comparison

In three experiments with three state-of-the-art recommenders, four of the evaluation methodologies are consistent with each other and differ from error metrics, in terms of the comparative recommenders' performance measurements.

Improving recommendation lists through topic diversification

This work presents topic diversification, a novel method designed to balance and diversify personalized recommendation lists in order to reflect the user's complete spectrum of interests, and introduces the intra-list similarity metric to assess the topical diversity of recommendation lists.

A Survey of Accuracy Evaluation Metrics of Recommendation Tasks

This paper reviews the proper construction of offline experiments for deciding on the most appropriate algorithm, and discusses three important tasks of recommender systems, and classify a set of appropriate well known evaluation metrics for each task.

Factorization meets the neighborhood: a multifaceted collaborative filtering model

The factor and neighborhood models can now be smoothly merged, thereby building a more accurate combined model and a new evaluation metric is suggested, which highlights the differences among methods, based on their performance at a top-K recommendation task.

Being accurate is not enough: how accuracy metrics have hurt recommender systems

This paper proposes informal arguments that the recommender community should move beyond the conventional accuracy metrics and their associated experimental methodologies, and proposes new user-centric directions for evaluating recommender systems.

Rethinking the recommender research ecosystem: reproducibility, openness, and LensKit

The utility of LensKit is demonstrated by replicating and extending a set of prior comparative studies of recommender algorithms, and a question recently raised by a leader in the recommender systems community on problems with error-based prediction evaluation is investigated.

User perception of differences in recommender algorithms

It is found that satisfaction is negatively dependent on novelty and positively dependent on diversity in this setting, and that satisfaction predicts the user's final selection of a recommender that they would like to use in the future.

Item-based collaborative filtering recommendation algorithms

This paper analyzes item-based collaborative ltering techniques and suggests that item- based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms.

Contrasting Offline and Online Results when Evaluating Recommendation Algorithms

Empirical evidence is presented that the ranking of algorithms based on offline accuracy measurements clearly contradicts the results from the online study with the same set of users, suggesting the external validity of the most commonly applied evaluation methodology is not guaranteed.