Offline evaluation options for recommender systems

  title={Offline evaluation options for recommender systems},
  author={Roc{\'i}o Ca{\~n}amares and Pablo Castells and Alistair Moffat},
  journal={Information Retrieval Journal},
We undertake a detailed examination of the steps that make up offline experiments for recommender system evaluation, including the manner in which the available ratings are filtered and split into training and test; the selection of a subset of the available users for the evaluation; the choice of strategy to handle the background effects that arise when the system is unable to provide scores for some items or users; the use of either full or condensed output lists for the purposes of scoring… 

Offline recommender system evaluation: Challenges and new directions

This work recap and reflect on the development and current status of recommender system evaluation, providing an updated perspective on the adaptation of IR principles, procedures and metrics, and the implications of those techniques when applied to recommender systems.

Where Do We Go From Here? Guidelines For Offline Recommender Evaluation

This paper examines four larger issues in recommender system research regarding uncertainty estimation, generalization, hyperparameter optimization and dataset pre-processing in more detail to arrive at a set of guidelines and presents a TrainRec, a lightweight and flexible toolkit for offline training and evaluation of recommender systems that implements these guidelines.

On Target Item Sampling in Offline Recommender System Evaluation

It is found that comparative evaluation using reduced target sets contradicts in many cases the corresponding outcome using large targets, and a principled explanation for these disagreements is provided.

Evaluating Recommender Systems: Survey and Framework

The FEVR framework provides a structured foundation to adopt adequate evaluation configurations that encompass this required multi-facettedness and provides the basis to advance in the field.

Exploring Data Splitting Strategies for the Evaluation of Recommendation Models

The results demonstrate that the splitting strategy employed is an important confounding variable that can markedly alter the ranking of recommender systems, making much of the currently published literature non-comparable, even when the same datasets and metrics are used.

New Insights into Metric Optimization for Ranking-based Recommendation

The results challenge the assumption behind the current research practice of optimizing and evaluating the same metric, and point to RBP-based optimization instead as a promising alternative when learning to rank in the recommendation context.

Comparison of online and offline evaluation metrics in Recommender Systems

A new approach of measuring recall is introduced to reflect better the sequence nature of interactions as well as their non-random distribution and increase correlation with the click-through rate.

On Offline Evaluation of Recommender Systems

It is shown that accessing to different amount of future data may improve or deteriorate a model's recommendation accuracy, and that more historical data in training set does not necessarily lead to better recommendation accuracy.

The Simpson's Paradox in the Offline Evaluation of Recommendation Systems

This article shows that the typical offline evaluation of recommender systems suffers from the so-called Simpson's paradox, and proposes a novel evaluation methodology that takes into account the confounder, i.e., the deployed system’s characteristics.

A Revisiting Study of Appropriate Offline Evaluation for Top-N Recommendation Algorithms

This work presents a large-scale, systematic study on six important factors from three aspects for evaluating recommender systems, and provides several suggested settings that are specially important for performance comparison.



Evaluating Recommender Systems

This paper discusses how to compare recommenders based on a set of properties that are relevant for the application, and focuses on comparative studies, where a few algorithms are compared using some evaluation metric, rather than absolute benchmarking of algorithms.

Statistical biases in Information Retrieval metrics for recommender systems

This paper lays out an experimental configuration framework upon which to identify and analyse specific statistical biases arising in the adaptation of Information Retrieval metrics to recommendation tasks, namely sparsity and popularity biases.

Evaluating collaborative filtering recommender systems

The key decisions in evaluating collaborative filtering recommender systems are reviewed: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole.

Training and testing of recommender systems on data missing not at random

It is shown that the absence of ratings carries useful information for improving the top-k hit rate concerning all items, a natural accuracy measure for recommendations, and two performance measures can be estimated, under mild assumptions, without bias from data even when ratings are missing not at random (MNAR).

Collaborative Filtering for Implicit Feedback Datasets

This work identifies unique properties of implicit feedback datasets and proposes treating the data as indication of positive and negative preference associated with vastly varying confidence levels, which leads to a factor model which is especially tailored for implicit feedback recommenders.

Comparative recommender system evaluation: benchmarking recommendation frameworks

This work compares common recommendation algorithms as implemented in three popular recommendation frameworks and shows the necessity of clear guidelines when reporting evaluation of recommender systems to ensure reproducibility and comparison of results.

Offline A/B Testing for Recommender Systems

This work proposes a new counterfactual estimator and provides a benchmark of the different estimators showing their correlation with business metrics observed by running online A/B tests on a large-scale commercial recommender system.

Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms

The results show that, contrary to much of the previous work on this topic, properly-conducted offline experiments do correlate well to A/B test results, and moreover that the authors can expect an offline evaluation to identify the best candidate systems for online testing with high probability.

Online Evaluation for Information Retrieval

This survey provides an overview of online evaluation techniques for information retrieval, and shows how online evaluation is used for controlled experiments, segmenting them into experiment designs that allow absolute or relative quality assessments.

Characterization of Fair Experiments for Recommender System Evaluation – A Formal Analysis

This paper addresses the question on the side of experimental fairness, meaning to what extent the outcome of a comparative experiment matches the underlying truth, and is not biased towards favoring a priori a particular algorithmic approach.