Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison

@article{Sun2020AreWE,
  title={Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison},
  author={Zhu Sun and Di Yu and Hui Fang and Jie Yang and Xinghua Qu and Jie Jennifer Zhang and Cong Geng},
  journal={Fourteenth ACM Conference on Recommender Systems},
  year={2020}
}
  • Zhu Sun, Di Yu, +4 authors Cong Geng
  • Published 22 September 2020
  • Computer Science
  • Fourteenth ACM Conference on Recommender Systems
With tremendous amount of recommendation algorithms proposed every year, one critical issue has attracted a considerable amount of attention: there are no effective benchmarks for evaluation, which leads to two major concerns, i.e., unreproducible evaluation and unfair comparison. This paper aims to conduct rigorous (i.e., reproducible and fair) evaluation for implicit-feedback based top-N recommendation algorithms. We first systematically review 85 recommendation papers published at eight top… Expand

Figures and Tables from this paper

On Offline Evaluation of Recommender Systems
TLDR
It is shown that accessing to different amount of future data may improve or deteriorate a model's recommendation accuracy, and that more historical data in training set does not necessarily lead to better recommendation accuracy. Expand
A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels
TLDR
It is proved that bounds on specific metrics used to evaluate clustering algorithms and multi-class classifiers can be computed without reference labels, and a procedure is introduced that uses an AGTR to identify inaccurate evaluation results produced from datasets of dubious quality. Expand
A comparative analysis of SOAP and REST Web service composition based on performance in local and remote Cloud environments
Web services are an attracting area that interest many researchers and industrial organizations. Given the convenience and reusability, web services have become a main mode of cloud application. WebExpand
Discussion Paper
Collaborative filtering recommender systems (CF-RSs) employ user-item feedback, e.g., ratings, purchases, or reviews, to harmonize similarities among customers and produce personalized lists ofExpand
Elliot: A Comprehensive and Rigorous Framework for Reproducible Recommender Systems Evaluation
TLDR
Elliot is a comprehensive recommendation framework that aims to run and reproduce an entire experimental pipeline by processing a simple configuration file and optimizes hyperparameters for several recommendation algorithms. Expand
Explaining recommender systems fairness and accuracy through the lens of data characteristics
TLDR
It is found that it is more difficult to explain variations in performance when dealing with fairness dimension than accuracy, and this work provides a systematic study on the impact of broadly chosen data characteristics of recommender systems. Expand
Fairness and Discrimination in Information Access Systems
TLDR
This monograph presents a taxonomy of the various dimensions of fair information access and survey the literature to date on this new and rapidly-growing topic. Expand
GLIMG: Global and Local Item Graphs for Top-N Recommender Systems
TLDR
This paper provides the first attempt to investigate multiple local item graphs along with a global item graph for graph-based recommendation models, and argues that recommendation on global and local graphs outperforms that on a single global graph or multiple local graphs. Expand
Hierarchical Latent Relation Modeling for Collaborative Metric Learning
TLDR
This paper presents a hierarchical CML model that jointly captures latent user-item and item-item relations from implicit data and empirically shows the relevance of this joint relational modeling, by outperforming existing CML models on recommendation tasks on several real-world datasets. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 56 REFERENCES
Performance of recommender algorithms on top-n recommendation tasks
TLDR
An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task, and new variants of two collaborative filtering algorithms are offered. Expand
On the Difficulty of Evaluating Baselines: A Study on Recommender Systems
TLDR
It is shown that running baselines properly is difficult and empirical findings in research papers are questionable unless they were obtained on standardized benchmarks where baselines have been tuned extensively by the research community. Expand
Collaborative Denoising Auto-Encoders for Top-N Recommender Systems
TLDR
It is demonstrated that the proposed model is a generalization of several well-known collaborative filtering models but with more flexible components, and that CDAE consistently outperforms state-of-the-art top-N recommendation methods on a variety of common evaluation metrics. Expand
LibRec: A Java Library for Recommender Systems
TLDR
An open-source Java library that implements a suite of state-of-the-art algorithms as well as a series of evaluation metrics is introduced, empirically finding that LibRec performs faster than other such libraries, while achieving competitive evaluative performance. Expand
A Step Toward Quantifying Independently Reproducible Machine Learning Research
TLDR
Man manually attempting to implement 255 papers published from 1984 until 2017, recording features of each paper, and performing statistical analysis of the results. Expand
AKUPM: Attention-Enhanced Knowledge-Aware User Preference Model for Recommendation
TLDR
A novel model named Attention-enhanced Knowledge-aware User Preference Model (AKUPM) is proposed for click-through rate (CTR) prediction, which achieves substantial gains in terms of common evaluation metrics over several state-of-the-art baselines. Expand
Are we really making much progress? A worrying analysis of recent neural recommendation approaches
TLDR
A systematic analysis of algorithmic proposals for top-n recommendation tasks that were presented at top-level research conferences in the last years sheds light on a number of potential problems in today's machine learning scholarship and calls for improved scientific practices in this area. Expand
Deep Learning Based Recommender System
TLDR
A taxonomy of deep learning-based recommendation models is provided and a comprehensive summary of the state of the art is provided, along with new perspectives pertaining to this new and exciting development of the field. Expand
DeepRec: An Open-source Toolkit for Deep Learning based Recommendation
TLDR
In this toolkit, a number of deep learning based recommendation algorithms using Python and the widely used deep learning package - Tensorflow are implemented and good modularity and extensibility are maintained to easily incorporate new models into the framework. Expand
KGAT: Knowledge Graph Attention Network for Recommendation
TLDR
This work proposes a new method named Knowledge Graph Attention Network (KGAT), which explicitly models the high-order connectivities in KG in an end-to-end fashion and significantly outperforms state-of-the-art methods like Neural FM and RippleNet. Expand
...
1
2
3
4
5
...