Comparative recommender system evaluation: benchmarking recommendation frameworks

  title={Comparative recommender system evaluation: benchmarking recommendation frameworks},
  author={Alan Said and Alejandro Bellog{\'i}n},
  booktitle={ACM Conference on Recommender Systems},
Recommender systems research is often based on comparisons of predictive accuracy: the better the evaluation scores, the better the recommender. [] Key Method To provide a fair comparison, we have complete control of the evaluation dimensions being benchmarked: dataset, data splitting, evaluation strategies, and metrics. We also include results using the internal evaluation mechanisms of these frameworks. Our analysis points to large differences in recommendation accuracy across frameworks and strategies, i…

Figures and Tables from this paper

Research Paper Recommender System Evaluation Using Coverage

A range of evaluation metrics and measures as well as some approaches used for evaluating recommendation systems are reviewed, showing that large differences in recommendation accuracy across frameworks and strategies are shown.

Mix and Rank: A Framework for Benchmarking Recommender Systems

This work proposes a novel benchmarking framework that mixes different evaluation measures in order to rank the recommender systems on each benchmark dataset, separately, and discovers sets of correlated measures as well as sets of evaluation measures that are least correlated.

Reproducibility of Experiments in Recommender Systems Evaluation

This paper compares well known recommendation algorithms, using the same dataset, metrics and overall settings, the results of which point to result differences across frameworks with the exact same settings.

Rival: a toolkit to foster reproducibility in recommender system evaluation

Some of the functionality of RiVal are presented and step-by-step how RiVal can be used to evaluate the results from any recommendation framework and make sure that the results are comparable and reproducible.

Non-transparent recommender system evaluation leads to misleading results

This work investigates the discrepancies between common open source recommender system frameworks and highlights the difference in evaluation protocols – even when the same evaluation metrics are employed, evidencing differences in their implementation.

Replicable Evaluation of Recommender Systems

This tutorial shows how to present evaluation results in a clear and concise manner, while ensuring that the results are comparable, replicable and unbiased.

Comparative Evaluation for Recommender Systems for Book Recommendations

An offline comparative evaluation of commonly used recommendation algorithms of collaborative filtering using the BookCrossing data set containing 1,149,780 user ratings on books shows the disparity of evaluation results between the RS frameworks.

BARS: Towards Open Benchmarking for Recommender Systems

This initiative project presents an initiative project aimed for open benchmarking for recommender systems, which sets up a standardized benchmarking pipeline for reproducible research, which integrates all the details about datasets, source code, hyper-parameter settings, running logs, and evaluation results.

Offline recommender system evaluation: Challenges and new directions

This work recap and reflect on the development and current status of recommender system evaluation, providing an updated perspective on the adaptation of IR principles, procedures and metrics, and the implications of those techniques when applied to recommender systems.



Rival: a toolkit to foster reproducibility in recommender system evaluation

Some of the functionality of RiVal are presented and step-by-step how RiVal can be used to evaluate the results from any recommendation framework and make sure that the results are comparable and reproducible.

A Survey of Accuracy Evaluation Metrics of Recommendation Tasks

This paper reviews the proper construction of offline experiments for deciding on the most appropriate algorithm, and discusses three important tasks of recommender systems, and classify a set of appropriate well known evaluation metrics for each task.

Evaluating Recommendation Systems

This paper discusses how to compare recommenders based on a set of properties that are relevant for the application, and focuses on comparative studies, where a few algorithms are compared using some evaluation metric, rather than absolute benchmarking of algorithms.

Time-aware recommender systems: a comprehensive survey and analysis of existing evaluation protocols

A comprehensive survey and analysis of the state of the art on time-aware recommender systems (TARS), and proposes a methodological description framework aimed to make the evaluation process fair and reproducible.

Performance of recommender algorithms on top-n recommendation tasks

An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task, and new variants of two collaborative filtering algorithms are offered.

Evaluating collaborative filtering recommender systems

The key decisions in evaluating collaborative filtering recommender systems are reviewed: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole.

Rethinking the recommender research ecosystem: reproducibility, openness, and LensKit

The utility of LensKit is demonstrated by replicating and extending a set of prior comparative studies of recommender algorithms, and a question recently raised by a leader in the recommender systems community on problems with error-based prediction evaluation is investigated.

Users and noise: the magic barrier of recommender systems

This work investigates the inconsistencies of the user ratings and estimates the magic barrier in order to assess the actual quality of the recommender system, and presents a mathematical characterization of themagic barrier based on the assumption that user ratings are afflicted with inconsistencies - noise.

Being accurate is not enough: how accuracy metrics have hurt recommender systems

This paper proposes informal arguments that the recommender community should move beyond the conventional accuracy metrics and their associated experimental methodologies, and proposes new user-centric directions for evaluating recommender systems.

Item-based collaborative filtering recommendation algorithms

This paper analyzes item-based collaborative ltering techniques and suggests that item- based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms.