A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation

@inproceedings{Beel2013ACA,
  title={A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation},
  author={Joeran Beel and Marcel Genzmehr and Stefan Langer and A. N{\"u}rnberger and Bela Gipp},
  booktitle={RepSys '13},
  year={2013}
}
Offline evaluations are the most common evaluation method for research paper recommender systems. However, no thorough discussion on the appropriateness of offline evaluations has taken place, despite some voiced criticism. We conducted a study in which we evaluated various recommendation approaches with both offline and online evaluations. We found that results of offline and online evaluations often contradict each other. We discuss this finding in detail and conclude that offline evaluations… 

Figures from this paper

A Comparison of Offline Evaluations, Online Evaluations, and User Studies in the Context of Research-Paper Recommender Systems

It is concluded that in practice, offline evaluations are probably not suitable to evaluate recommender systems, particularly in the domain of research paper recommendations.

Multi-method Evaluation in Scientific Paper Recommender Systems

A scientific paper recommender system (SPRS) prototype which was subject to both offline and user evaluations is presented and the lessons learnt from the evaluation studies are described.

Comparing Offline and Online Recommender System Evaluations on Long-tail Distributions

By focusing on recommendations of long-tail items, which are usually more interesting for users, it was possible to reduce the bias caused by extremely popular items and to observe a better alignment of accuracy results in oine and online evaluations.

Research paper recommender system evaluation: a quantitative literature survey

It is currently not possible to determine which recommendation approaches for academic literature recommendation are the most promising, but there is little value in the existence of more than 80 approaches if the best performing approaches are unknown.

Random Performance Differences Between Online Recommender System Algorithms

The experiments aim to quantify the expected degree of variation in performance that cannot be attributed to differences between systems, and classify and discuss the non-algorithmic causes of performance differences observed.

The Comparability of Recommender System Evaluations and Characteristics of Docear ’ s Users

This paper shows that reporting demographic and usage-based data is crucial in order to create meaningful evaluations on Docear's recommender system and sets previous evaluations into context and helps others to compare their results with the authors'.

Meta-analysis of evaluation methods and metrics used in context-aware scholarly recommender systems

Meta-analyses of the evaluation methods and metrics of 67 studies related to context-aware scholarly recommender systems from the years 2000 to 2014 show that offline evaluation methods are more commonly used compared to online and user studies, with the maximum rate of success.

Survey on Evaluation of Recommender Systems

Recommender Systems (RSs) can be found in many modern applications and that expose the user to a huge collections of items and helps user to decide on appropriate items, and ease the task of finding

Research Paper Recommender System Evaluation Using Coverage

A range of evaluation metrics and measures as well as some approaches used for evaluating recommendation systems are reviewed, showing that large differences in recommendation accuracy across frameworks and strategies are shown.

Measuring the Business Value of Recommender Systems

A review of existing publications on field tests of recommender systems and which business-related performance measures were used in such real-world deployments indicates that various open questions remain regarding the realistic quantification of the business effects of recommenders and the performance assessment of recommendation algorithms in academia.
...

References

SHOWING 1-10 OF 27 REFERENCES

Research paper recommender system evaluation: a quantitative literature survey

It is currently not possible to determine which recommendation approaches for academic literature recommendation are the most promising, but there is little value in the existence of more than 80 approaches if the best performing approaches are unknown.

Evaluating collaborative filtering recommender systems

The key decisions in evaluating collaborative filtering recommender systems are reviewed: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole.

What Recommenders Recommend - An Analysis of Accuracy, Popularity, and Sales Diversity Effects

This first analysis on different data sets shows that some RS algorithms – while able to generate highly accurate predictions – concentrate their top 10 recommendations on a very small fraction of the product catalog or have a strong bias to recommending only relatively popular items than others.

A Survey of Accuracy Evaluation Metrics of Recommendation Tasks

This paper reviews the proper construction of offline experiments for deciding on the most appropriate algorithm, and discusses three important tasks of recommender systems, and classify a set of appropriate well known evaluation metrics for each task.

Beyond accuracy: evaluating recommender systems by coverage and serendipity

It is argued that the new ways of measuring coverage and serendipity reflect the quality impression perceived by the user in a better way than previous metrics thus leading to enhanced user satisfaction.

Explaining the user experience of recommender systems

This paper proposes a framework that takes a user-centric approach to recommender system evaluation that links objective system aspects to objective user behavior through a series of perceptual and evaluative constructs (called subjective system aspects and experience, respectively).

Evaluating Recommendation Systems

This paper discusses how to compare recommenders based on a set of properties that are relevant for the application, and focuses on comparative studies, where a few algorithms are compared using some evaluation metric, rather than absolute benchmarking of algorithms.

The Impact of Demographics (Age and Gender) and Other User-Characteristics on Evaluating Recommender Systems

It was found that elderly users clicked more often on recommendations than younger ones and future research articles on recommender systems should report detailed data on their users to make results better comparable.

Introducing Docear's research paper recommender system

This demo paper presents Docear's research paper recommender system, an academic literature suite to search, organize, and create research articles that achieves click-through rates around 6%, in some scenarios even over 10%.

Recommender systems: from algorithms to user experience

It is argued that evaluating the user experience of a recommender requires a broader set of measures than have been commonly used, and additional measures that have proven effective are suggested.