Corpus ID: 237503098

Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse

  title={Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse},
  author={Ngozi I. Ihemelandu and Michael D. Ekstrand},
This paper calls attention to the missing component of the recommender system evaluation process: Statistical Inference. There is active research in several components of the recommender system evaluation process: selecting baselines, standardizing benchmarks, and target item sampling. However, there has not yet been significant work on the role and use of statistical inference for analyzing recommender system evaluation results. In this paper, we argue that the use of statistical inference is… Expand

Tables from this paper


Using score distributions to compare statistical significance tests for information retrieval evaluation
It is argued here that current methods for assessing the reliability of statistical tests suffer from some methodological weaknesses, and a novel way to study significance tests for retrieval evaluation is proposed, using Score Distributions. Expand
Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors
Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, this study is finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners. Expand
Statistical reform in information retrieval?
Suggestions on how to report effect sizes and confidence intervals along with p-values, in the context of comparing IR systems using test collections, will make IR papers more informative, and help researchers form more reliable conclusions that "add up". Expand
How do Metric Score Distributions affect the Type I Error Rate of Statistical Significance Tests in Information Retrieval?
Statistical significance tests are the main tool that IR practitioners use to determine the reliability of their experimental evaluation results. The question of which test behaves best with IRExpand
Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation
A new methodology for assessing the behavior of significance tests in typical ranking tasks is presented and results conclusively suggest that the Wilcoxon test is the most reliable test and IR practitioners should adopt it as the reference tool to assess differences between IR systems. Expand
Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison
This paper systematically review 85 recommendation papers published at eight top-tier conferences and creates benchmarks with standardized procedures and provides the performance of seven well-tuned state-of-the-arts across six metrics on six widely-used datasets as a reference for later study. Expand
Estimating Error and Bias in Offline Evaluation Results
It is found that missing data in the rating or observation process causes the evaluation protocol to systematically mis-estimate metric values, and in some cases erroneously determine that a popularity-based recommender outperforms even a perfect personalized recommender. Expand
On Target Item Sampling in Offline Recommender System Evaluation
It is found that comparative evaluation using reduced target sets contradicts in many cases the corresponding outcome using large targets, and a principled explanation for these disagreements is provided. Expand
Are we really making much progress? A worrying analysis of recent neural recommendation approaches
A systematic analysis of algorithmic proposals for top-n recommendation tasks that were presented at top-level research conferences in the last years sheds light on a number of potential problems in today's machine learning scholarship and calls for improved scientific practices in this area. Expand
On the Difficulty of Evaluating Baselines: A Study on Recommender Systems
It is shown that running baselines properly is difficult and empirical findings in research papers are questionable unless they were obtained on standardized benchmarks where baselines have been tuned extensively by the research community. Expand