Comparative recommender system evaluation: benchmarking recommendation frameworks

@inproceedings{Said2014ComparativeRS,
  title={Comparative recommender system evaluation: benchmarking recommendation frameworks},
  author={A. Said and Alejandro Bellog{\'i}n},
  booktitle={RecSys '14},
  year={2014}
}
Recommender systems research is often based on comparisons of predictive accuracy: the better the evaluation scores, the better the recommender. [...] Key Method To provide a fair comparison, we have complete control of the evaluation dimensions being benchmarked: dataset, data splitting, evaluation strategies, and metrics. We also include results using the internal evaluation mechanisms of these frameworks. Our analysis points to large differences in recommendation accuracy across frameworks and strategies, i…Expand

Figures, Tables, and Topics from this paper

Research Paper Recommender System Evaluation Using Coverage
Recommendation systems(RS) support users and developers of various computer and software systems to overcome information overload, perform information discovery tasks and approximate computation,Expand
Mix and Rank: A Framework for Benchmarking Recommender Systems
TLDR
This work proposes a novel benchmarking framework that mixes different evaluation measures in order to rank the recommender systems on each benchmark dataset, separately, and discovers sets of correlated measures as well as sets of evaluation measures that are least correlated. Expand
Reproducibility of Experiments in Recommender Systems Evaluation
TLDR
This paper compares well known recommendation algorithms, using the same dataset, metrics and overall settings, the results of which point to result differences across frameworks with the exact same settings. Expand
Rival: a toolkit to foster reproducibility in recommender system evaluation
TLDR
Some of the functionality of RiVal are presented and step-by-step how RiVal can be used to evaluate the results from any recommendation framework and make sure that the results are comparable and reproducible. Expand
Non-transparent recommender system evaluation leads to misleading results
TLDR
This work investigates the discrepancies between common open source recommender system frameworks and highlights the difference in evaluation protocols – even when the same evaluation metrics are employed, evidencing differences in their implementation. Expand
Replicable Evaluation of Recommender Systems
TLDR
This tutorial shows how to present evaluation results in a clear and concise manner, while ensuring that the results are comparable, replicable and unbiased. Expand
Comparative Evaluation for Recommender Systems for Book Recommendations
TLDR
An offline comparative evaluation of commonly used recommendation algorithms of collaborative filtering using the BookCrossing data set containing 1,149,780 user ratings on books shows the disparity of evaluation results between the RS frameworks. Expand
A Framework for Evaluating Personalized Ranking Systems by Fusing Different Evaluation Measures
TLDR
This work provides a general framework that can handle an arbitrary number of evaluation measures and help end-users rank the systems available to them and investigates the robustness of the proposed methodology using published results from an experimental study involving multiple big datasets and evaluation measures. Expand
Attribute-based evaluation for recommender systems: incorporating user and item attributes in evaluation metrics
TLDR
This work exploits item attributes to consider some recommended items as surrogates of those interacted by the user, with a proper penalization, and results evidence that this novel evaluation methodology allows to capture different nuances of the algorithms performance, inherent biases in the data, and even fairness of the recommendations. Expand
Exploring Data Splitting Strategies for the Evaluation of Recommendation Models
TLDR
The results demonstrate that the splitting strategy employed is an important confounding variable that can markedly alter the ranking of recommender systems, making much of the currently published literature non-comparable, even when the same datasets and metrics are used. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 35 REFERENCES
Rival: a toolkit to foster reproducibility in recommender system evaluation
TLDR
Some of the functionality of RiVal are presented and step-by-step how RiVal can be used to evaluate the results from any recommendation framework and make sure that the results are comparable and reproducible. Expand
A Survey of Accuracy Evaluation Metrics of Recommendation Tasks
TLDR
This paper reviews the proper construction of offline experiments for deciding on the most appropriate algorithm, and discusses three important tasks of recommender systems, and classify a set of appropriate well known evaluation metrics for each task. Expand
Precision-oriented evaluation of recommender systems: an algorithmic comparison
TLDR
In three experiments with three state-of-the-art recommenders, four of the evaluation methodologies are consistent with each other and differ from error metrics, in terms of the comparative recommenders' performance measurements. Expand
Evaluating Recommendation Systems
TLDR
This paper discusses how to compare recommenders based on a set of properties that are relevant for the application, and focuses on comparative studies, where a few algorithms are compared using some evaluation metric, rather than absolute benchmarking of algorithms. Expand
Time-aware recommender systems: a comprehensive survey and analysis of existing evaluation protocols
TLDR
A comprehensive survey and analysis of the state of the art on time-aware recommender systems (TARS), and proposes a methodological description framework aimed to make the evaluation process fair and reproducible. Expand
Performance of recommender algorithms on top-n recommendation tasks
TLDR
An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task, and new variants of two collaborative filtering algorithms are offered. Expand
Evaluating collaborative filtering recommender systems
TLDR
The key decisions in evaluating collaborative filtering recommender systems are reviewed: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole. Expand
Rethinking the recommender research ecosystem: reproducibility, openness, and LensKit
TLDR
The utility of LensKit is demonstrated by replicating and extending a set of prior comparative studies of recommender algorithms, and a question recently raised by a leader in the recommender systems community on problems with error-based prediction evaluation is investigated. Expand
Users and noise: the magic barrier of recommender systems
TLDR
This work investigates the inconsistencies of the user ratings and estimates the magic barrier in order to assess the actual quality of the recommender system, and presents a mathematical characterization of themagic barrier based on the assumption that user ratings are afflicted with inconsistencies - noise. Expand
Being accurate is not enough: how accuracy metrics have hurt recommender systems
TLDR
This paper proposes informal arguments that the recommender community should move beyond the conventional accuracy metrics and their associated experimental methodologies, and proposes new user-centric directions for evaluating recommender systems. Expand
...
1
2
3
4
...