Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison

  title={Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison},
  author={Zhu Sun and Di Yu and Hui Fang and Jie Yang and Xinghua Qu and Jie Zhang and Cong Geng},
  journal={Proceedings of the 14th ACM Conference on Recommender Systems},
  • Zhu SunDi Yu Cong Geng
  • Published 22 September 2020
  • Computer Science
  • Proceedings of the 14th ACM Conference on Recommender Systems
With tremendous amount of recommendation algorithms proposed every year, one critical issue has attracted a considerable amount of attention: there are no effective benchmarks for evaluation, which leads to two major concerns, i.e., unreproducible evaluation and unfair comparison. This paper aims to conduct rigorous (i.e., reproducible and fair) evaluation for implicit-feedback based top-N recommendation algorithms. We first systematically review 85 recommendation papers published at eight top… 

Figures and Tables from this paper

Where Do We Go From Here? Guidelines For Offline Recommender Evaluation

This paper examines four larger issues in recommender system research regarding uncertainty estimation, generalization, hyperparameter optimization and dataset pre-processing in more detail to arrive at a set of guidelines and presents a TrainRec, a lightweight and flexible toolkit for offline training and evaluation of recommender systems that implements these guidelines.

A Revisiting Study of Appropriate Offline Evaluation for Top-N Recommendation Algorithms

This work presents a large-scale, systematic study on six important factors from three aspects for evaluating recommender systems, and provides several suggested settings that are specially important for performance comparison.

Top-N Recommendation Algorithms: A Quest for the State-of-the-Art

This work reports the outcomes of an in-depth, systematic, and reproducible comparison of ten collaborative filtering algorithms—covering both traditional and neural models—on several common performance measures on three datasets which are frequently used for evaluation in the recent literature.

On Offline Evaluation of Recommender Systems

It is shown that accessing to different amount of future data may improve or deteriorate a model's recommendation accuracy, and that more historical data in training set does not necessarily lead to better recommendation accuracy.

A Critical Study on Data Leakage in Recommender System Offline Evaluation

A comprehensive and critical analysis of the data leakage issue in recommender system offline evaluation is provided, which shows that data leakage does impact models’ recommendation accuracy and proposes a timeline scheme, which calls for a revisit of the recommendation model design.

A Systematical Evaluation for Next-Basket Recommendation Algorithms

A systematical empirical study in NBR area that runs the selected NBR algorithms on the same datasets, under the same experimental setting and evaluate their performances using the same measurements provides a unified framework to fairly compare different NBR approaches.

Group Validation in Recommender Systems: Framework for Multi-layer Performance Evaluation

This article focuses on the concept of data clustering for evaluation in recommenders and applies a neighborhood assessment method for the datasets of recommender system applications, which aids in better understanding critical performance variations in more compact subsets of the system.

BARS: Towards Open Benchmarking for Recommender Systems

This initiative project presents an initiative project aimed for open benchmarking for recommender systems, which sets up a standardized benchmarking pipeline for reproducible research, which integrates all the details about datasets, source code, hyper-parameter settings, running logs, and evaluation results.

Progress in Recommender Systems Research: Crisis? What Crisis?

Scholars in algorithmic recommender systems research have developed a largely standardized scientific method, where progress is claimed by showing that a new algorithm outperforms existing ones on or

Quality Metrics in Recommender Systems: Do We Calculate Metrics Consistently?

Quality metrics used for recommender systems evaluation are investigated and it is found that Precision is the only metric universally understood among papers and libraries, while other metrics may have different interpretations.



Performance of recommender algorithms on top-n recommendation tasks

An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task, and new variants of two collaborative filtering algorithms are offered.

Item-based collaborative filtering recommendation algorithms

This paper analyzes item-based collaborative ltering techniques and suggests that item- based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms.

Practical Bayesian Optimization of Machine Learning Algorithms

This work describes new algorithms that take into account the variable cost of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation and shows that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms.

Are we really making much progress? A worrying analysis of recent neural recommendation approaches

A systematic analysis of algorithmic proposals for top-n recommendation tasks that were presented at top-level research conferences in the last years sheds light on a number of potential problems in today's machine learning scholarship and calls for improved scientific practices in this area.

KGAT: Knowledge Graph Attention Network for Recommendation

This work proposes a new method named Knowledge Graph Attention Network (KGAT), which explicitly models the high-order connectivities in KG in an end-to-end fashion and significantly outperforms state-of-the-art methods like Neural FM and RippleNet.

On the Difficulty of Evaluating Baselines: A Study on Recommender Systems

It is shown that running baselines properly is difficult and empirical findings in research papers are questionable unless they were obtained on standardized benchmarks where baselines have been tuned extensively by the research community.

DeepRec: An Open-source Toolkit for Deep Learning based Recommendation

In this toolkit, a number of deep learning based recommendation algorithms using Python and the widely used deep learning package - Tensorflow are implemented and good modularity and extensibility are maintained to easily incorporate new models into the framework.

Variational Autoencoders for Collaborative Filtering

A generative model with multinomial likelihood and use Bayesian inference for parameter estimation is introduced and the pros and cons of employing a principledBayesian inference approach are identified and characterize settings where it provides the most significant improvements.

Neural Factorization Machines for Sparse Predictive Analytics

NFM seamlessly combines the linearity of FM in modelling second- order feature interactions and the non-linearity of neural network in modelling higher-order feature interactions, and is more expressive than FM since FM can be seen as a special case of NFM without hidden layers.

Collaborative Denoising Auto-Encoders for Top-N Recommender Systems

It is demonstrated that the proposed model is a generalization of several well-known collaborative filtering models but with more flexible components, and that CDAE consistently outperforms state-of-the-art top-N recommendation methods on a variety of common evaluation metrics.