Offline evaluation options for recommender systems

@article{Caamares2020OfflineEO,
  title={Offline evaluation options for recommender systems},
  author={Roc{\'i}o Ca{\~n}amares and Pablo Castells and Alistair Moffat},
  journal={Information Retrieval Journal},
  year={2020},
  volume={23},
  pages={387-410}
}
We undertake a detailed examination of the steps that make up offline experiments for recommender system evaluation, including the manner in which the available ratings are filtered and split into training and test; the selection of a subset of the available users for the evaluation; the choice of strategy to handle the background effects that arise when the system is unable to provide scores for some items or users; the use of either full or condensed output lists for the purposes of scoring… Expand
Exploring Data Splitting Strategies for the Evaluation of Recommendation Models
TLDR
The results demonstrate that the splitting strategy employed is an important confounding variable that can markedly alter the ranking of recommender systems, making much of the currently published literature non-comparable, even when the same datasets and metrics are used. Expand
New Insights into Metric Optimization for Ranking-based Recommendation
TLDR
The results challenge the assumption behind the current research practice of optimizing and evaluating the same metric, and point to RBP-based optimization instead as a promising alternative when learning to rank in the recommendation context. Expand
Comparison of online and offline evaluation metrics in Recommender Systems
The goal of this work is to explore Recommender Systems and methods of evaluating them. The focus is on comparing online and offline approaches of evaluation, as their relationship is highlyExpand
On Offline Evaluation of Recommender Systems
TLDR
It is shown that accessing to different amount of future data may improve or deteriorate a model's recommendation accuracy, and that more historical data in training set does not necessarily lead to better recommendation accuracy. Expand
The Simpson's Paradox in the Offline Evaluation of Recommendation Systems
TLDR
It is shown that the typical offline evaluation of recommender systems suffers from the so-called Simpson's paradox, and a novel evaluation methodology is proposed that takes into account the confounder, i.e the deployed system’s characteristics. Expand
Offline Evaluation Standards for Recommender Systems
TLDR
An offline evaluation framework is presented that compiles the primary directives, pitfalls, and knowledge raised in the last five years by representative studies in the Recommendation Systems literature, and a Reliability Score is proposed to quantify how close a given offline evaluation setting is from the idealised framework instantiated to a given domain and task. Expand
Towards Unified Metrics for Accuracy and Diversity for Recommender Systems
TLDR
This work proposes a novel adaptation of a unified metric, derived from one commonly used for search system evaluation, to Recommender Systems, and shows that the metric respects the desired theoretical constraints and behaves as expected when performing offline evaluation. Expand
Quality Metrics in Recommender Systems: Do We Calculate Metrics Consistently?
TLDR
Quality metrics used for recommender systems evaluation are investigated and it is found that Precision is the only metric universally understood among papers and libraries, while other metrics may have different interpretations. Expand
Evaluating Information Retrieval Systems for Kids
TLDR
Many perspectives that must be considered when evaluating IRS are explored, and problems faced by researchers who work with children IRS are discussed, including lack of evaluation frameworks, limitations of data, and lack of user judgment understanding. Expand
A qualitative study of large-scale recommendation algorithms for biomedical knowledge bases
TLDR
Evaluating the recommendation algorithms in a large-scale biomedical knowledge base with the goal of identifying relative weaknesses and strengths of each algorithm provides unique insights into the performance of recommendation algorithms against the needs of modern-day biomedical researchers. Expand
...
1
2
...

References

SHOWING 1-10 OF 69 REFERENCES
Evaluating Recommender Systems
TLDR
This paper discusses how to compare recommenders based on a set of properties that are relevant for the application, and focuses on comparative studies, where a few algorithms are compared using some evaluation metric, rather than absolute benchmarking of algorithms. Expand
Evaluating Recommendation Systems
TLDR
This paper discusses how to compare recommenders based on a set of properties that are relevant for the application, and focuses on comparative studies, where a few algorithms are compared using some evaluation metric, rather than absolute benchmarking of algorithms. Expand
Statistical biases in Information Retrieval metrics for recommender systems
TLDR
This paper lays out an experimental configuration framework upon which to identify and analyse specific statistical biases arising in the adaptation of Information Retrieval metrics to recommendation tasks, namely sparsity and popularity biases. Expand
Evaluating collaborative filtering recommender systems
TLDR
The key decisions in evaluating collaborative filtering recommender systems are reviewed: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole. Expand
Training and testing of recommender systems on data missing not at random
TLDR
It is shown that the absence of ratings carries useful information for improving the top-k hit rate concerning all items, a natural accuracy measure for recommendations, and two performance measures can be estimated, under mild assumptions, without bias from data even when ratings are missing not at random (MNAR). Expand
Collaborative Filtering for Implicit Feedback Datasets
TLDR
This work identifies unique properties of implicit feedback datasets and proposes treating the data as indication of positive and negative preference associated with vastly varying confidence levels, which leads to a factor model which is especially tailored for implicit feedback recommenders. Expand
Comparative recommender system evaluation: benchmarking recommendation frameworks
TLDR
This work compares common recommendation algorithms as implemented in three popular recommendation frameworks and shows the necessity of clear guidelines when reporting evaluation of recommender systems to ensure reproducibility and comparison of results. Expand
Offline A/B Testing for Recommender Systems
TLDR
This work proposes a new counterfactual estimator and provides a benchmark of the different estimators showing their correlation with business metrics observed by running online A/B tests on a large-scale commercial recommender system. Expand
Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms
TLDR
The results show that, contrary to much of the previous work on this topic, properly-conducted offline experiments do correlate well to A/B test results, and moreover that the authors can expect an offline evaluation to identify the best candidate systems for online testing with high probability. Expand
Online Evaluation for Information Retrieval
TLDR
This survey provides an overview of online evaluation techniques for information retrieval, and shows how online evaluation is used for controlled experiments, segmenting them into experiment designs that allow absolute or relative quality assessments. Expand
...
1
2
3
4
5
...