• Corpus ID: 237503098

Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse

  title={Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse},
  author={Ngozi I. Ihemelandu and Michael D. Ekstrand},
This paper calls attention to the missing component of the recommender system evaluation process: Statistical Inference. There is active research in several components of the recommender system evaluation process: selecting baselines, standardizing benchmarks, and target item sampling. However, there has not yet been significant work on the role and use of statistical inference for analyzing recommender system evaluation results. In this paper, we argue that the use of statistical inference is… 
1 Citations

Tables from this paper

Report on the 1st workshop on the perspectives on the evaluation of recommender systems (PERSPECTIVES 2021) at RecSys 2021
The primary goal of the workshop was to capture the current state of evaluation from different, and maybe even diverging or contradictory perspectives.


Statistical biases in Information Retrieval metrics for recommender systems
This paper lays out an experimental configuration framework upon which to identify and analyse specific statistical biases arising in the adaptation of Information Retrieval metrics to recommendation tasks, namely sparsity and popularity biases.
Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors
Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, this study is finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.
How do Metric Score Distributions affect the Type I Error Rate of Statistical Significance Tests in Information Retrieval?
The robustness of statistical tests to different factors is analyzed, thus identifying under what conditions they behave well or not with respect to the Type I error rate, and suggests that differences between the Wilcoxon and t-test may be explained by the skewness of score differences.
Multiple testing in statistical analysis of systems-based information retrieval experiments
It is demonstrated how to model a set of IR experiments for analysis both mathematically and practically, and it is shown that doing so can cause p-values from statistical hypothesis tests to increase by orders of magnitude.
Should I Follow the Crowd?: A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems
A crowdsourced dataset devoid of the usual biases displayed by common publicly available data is built, in which contradictions between the accuracy that would be measured in a common biased offline experimental setting, and the actual accuracy that can be measured with unbiased observations are illustrated.
Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation
A new methodology for assessing the behavior of significance tests in typical ranking tasks is presented and results conclusively suggest that the Wilcoxon test is the most reliable test and IR practitioners should adopt it as the reference tool to assess differences between IR systems.
Statistical reform in information retrieval?
Suggestions on how to report effect sizes and confidence intervals along with p-values, in the context of comparing IR systems using test collections, will make IR papers more informative, and help researchers form more reliable conclusions that "add up".
A Probabilistic Reformulation of Memory-Based Collaborative Filtering: Implications on Popularity Biases
A probabilistic formulation giving rise to a formal version of heuristic k nearest-neighbor (kNN) collaborative filtering provides a principled explanation why kNN is an effective recommendation strategy, and identifies a key condition for this to be the case.
Toward identification and adoption of best practices in algorithmic recommender systems research
This work aims to address a growing concern that the Recommender Systems research community is facing a crisis where a significant number of research papers lack the rigor and evaluation to be properly judged and, therefore, have little to contribute to collective knowledge.
Sturgeon and the Cool Kids: Problems with Random Decoys for Top-N Recommender Evaluation
This work explores the random decoy strategy through both a theoretical treatment and an empirical study, but finds little evidence to guide its tuning and shows that it has complex and deleterious interactions with popularity bias.