Common Pitfalls in Training and Evaluating Recommender Systems

  title={Common Pitfalls in Training and Evaluating Recommender Systems},
  author={Hung-Hsuan Chen and Chu-An Chung and Hsin-Chien Huang and Wen Tsui},
  journal={SIGKDD Explor.},
This paper formally presents four common pitfalls in training and evaluating recommendation algorithms for information systems. Specifically, we show that it could be problematic to separate the server logs into training and test data for model generation and model evaluation if the training and the test data are selected improperly. In addition, we show that click through rate { a common metric to measure and compare the performance of different recommendation algorithms -- may not be a good… 

Figures and Tables from this paper

On Offline Evaluation of Recommender Systems

It is shown that accessing to different amount of future data may improve or deteriorate a model's recommendation accuracy, and that more historical data in training set does not necessarily lead to better recommendation accuracy.

A Critical Study on Data Leakage in Recommender System Offline Evaluation

A comprehensive and critical analysis of the data leakage issue in recommender system offline evaluation is provided, which shows that data leakage does impact models’ recommendation accuracy and proposes a timeline scheme, which calls for a revisit of the recommendation model design.

From Counter-intuitive Observations to a Fresh Look at Recommender System

This opinion paper discusses the importance of the global timeline of user-item interactions, and tries to answer why the simplest model popularity is often ill-defined in academic research and why the popularity baseline is evaluated in this way.

Differentiating Regularization Weights -- A Simple Mechanism to Alleviate Cold Start in Recommender Systems

The proposed methodology on three baseline models -- SVD, SVD++, and the NMF models are applied and it is found that this technique improves the prediction accuracy for all these baseline models and better predicts the ratings on the long-tail items, i.e., the items that were rated/viewed/purchased by few users.

Behavior2Vec: Generating Distributed Representations of Users' Behaviors on Products for Recommender Systems

By leveraging on the cosine distance between the distributed representations of the behaviors on items under different contexts, a user’s next clicking or purchasing item more precisely is predicted more precisely, compared to several baseline methods.

Empirically Testing Deep and Shallow Ranking Models for Click-Through Rate (CTR) prediction

An error analysis is performed to investigate when the deep learning models perform better than simple models and when they do not, and it is found that recommendations based on a simple neighbor-based model, on average, outperform the results generated byDeep learning models based on two datasets from e-commerce websites.

Personalized Travel Product Recommendation Based on Embedding of Multi-Behavior Interaction Network and Product Information Knowledge Graph

  • Li-Pin XiaoPo-Ruey LeiW. Peng
  • Computer Science
    2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI)
  • 2020
A hybrid recommendation model is proposed to tackle two challenges in the recommendation system: the cold product issue and the skewed distribution problem, which takes the product information into consideration by using the metadata of products and extracting more features from the textual contents to form a knowledge graph.

Building effective recommender systems for tourists

A novel RS approach is discussed that copes with the specific application constraints of the domain and produces recommendations that better match the true needs of tourists and some significant limitations of current evaluation approaches are discussed.

Accelerating Matrix Factorization by Overparameterization

It is found that overparameterization can accelerate the optimization of MF with no change in the expressiveness of the learning model, and modern applications on recommendations based on MF or its variants can largely benefit from this discovery.

Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web Logs

It is found that web logs in Chrome’s browsing history only record 57% of users’ visited websites, and it is shown that sorting popular website categories based on traditional web logs differs from the rankings obtained when including missing visits or excluding unintentional visits.



Evaluating collaborative filtering recommender systems

The key decisions in evaluating collaborative filtering recommender systems are reviewed: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole.

Solving the apparent diversity-accuracy dilemma of recommender systems

This paper introduces a new algorithm specifically to address the challenge of diversity and shows how it can be used to resolve this apparent dilemma when combined in an elegant hybrid with an accuracy-focused algorithm.

Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms

This paper introduces a replay methodology for contextual bandit algorithm evaluation that is completely data-driven and very easy to adapt to different applications and can provide provably unbiased evaluations.

Item-based top-N recommendation algorithms

This article presents one class of model-based recommendation algorithms that first determines the similarities between the various items and then uses them to identify the set of items to be recommended, and shows that these item-based algorithms are up to two orders of magnitude faster than the traditional user-neighborhood based recommender systems and provide recommendations with comparable or better quality.

Item popularity and recommendation accuracy

A new accuracy measure is defined that has the desirable property of providing nearly unbiased estimates concerning recommendation accuracy and also motivates a refinement for training collaborative-filtering approaches.

Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study

This paper proposes to address the problem of estimating online metrics that depend on user feedback using causal inference techniques, under the contextual-bandit framework, and obtains very promising results that suggest the wide applicability of these techniques.

RecSys Challenge 2015 and the YOOCHOOSE Dataset

The 2015 ACM Recommender Systems Challenge offered the opportunity to work on a large-scale e-commerce dataset from a big retailer in Europe which is accepting recommender system as a service from YOOCHOOSE, attracting 850 teams from 49 countries which submitted a total of 5,437 solutions.

Diversifying search results

This work proposes an algorithm that well approximates this objective in general, and is provably optimal for a natural special case, and generalizes several classical IR metrics, including NDCG, MRR, and MAP, to explicitly account for the value of diversification. Recommendations: Item-to-Item Collaborative Filtering

This work compares three common approaches to solving the recommendation problem: traditional collaborative filtering, cluster models, and search-based methods, and their algorithm, which is called item-to-item collaborative filtering.

Field-aware Factorization Machines for CTR Prediction

This paper establishes FFMs as an effective method for classifying large sparse data including those from CTR prediction, and proposes efficient implementations for training FFMs and comprehensively analyze FFMs.