From Evaluating to Forecasting Performance: How to Turn Information Retrieval, Natural Language Processing and Recommender Systems into Predictive Sciences (Dagstuhl Perspectives Workshop 17442)

  title={From Evaluating to Forecasting Performance: How to Turn Information Retrieval, Natural Language Processing and Recommender Systems into Predictive Sciences (Dagstuhl Perspectives Workshop 17442)},
  author={N. Ferro and Norbert Fuhr and Gregory Grefenstette and Joseph A. Konstan and Pablo Castells and Elizabeth M. Daly and Thierry Declerck and Michael D. Ekstrand and Werner Geyer and Julio Gonzalo and Tsvi Kuflik and Krister Lind{\'e}n and Bernardo Magnini and Jian-Yun Nie and R. Perego and Bracha Shapira and Ian Soboroff and Nava Tintarev and Karin M. Verspoor and Martijn C. Willemsen and Justin Zobel},
  journal={Dagstuhl Manifestos},
We describe the state-of-the-art in performance modeling and prediction for Information Retrieval (IR), Natural Language Processing (NLP) and Recommender Systems (RecSys) along with its shortcomings and strengths. We present a framework for further research, identifying five major problem areas: understanding measures, performance analysis, making underlying assumptions explicit, identifying application features determining performance, and the development of prediction models describing the… 

Figures from this paper

Human-Centered Recommender Systems: Origins, Advances, Challenges, and Opportunities
This article reviews 25 years of recommender systems research from a human-centered perspective, looking at the interface and algorithm studies that advanced the authors' understanding of how system designs can be tailored to users objectives and needs.
Report on GLARE 2018: 1st Workshop on Generalization in Information Retrieval
This is a report on the first edition of the International Workshop on Generalization in Information Retrieval (GLARE 2018), co-located with the 27th ACM International Conference on Information and
Offline evaluation options for recommender systems
It is shown that varying the split between training and test data, or changing the evaluation metric, or how target items are selected, orHow empty recommendations are dealt with, can give rise to comparisons that are vulnerable to misinterpretation, and may lead to different or even opposite outcomes, depending on the exact combination of settings used.
Improving Accountability in Recommender Systems Research Through Reproducibility
This work argues that, by facilitating reproducibility of recommender system experimentation, it indirectly address the issues of accountability and transparency in recommender systems research from the perspectives of practitioners, designers, and engineers aiming to assess the capabilities of published research works.
Limits to Surprise in Recommender Systems
There is a limit to how much surprise any algorithm can embed in a recommendation, and this limit can provide us with a scale against which the performance of any algorithms can be measured, and a surprise metric called "normalised surprise" is designed that employs these limits to potential surprise.
Using Collection Shards to Study Retrieval Performance Effect Sizes
This work uses the general linear mixed model framework and presents a model that encompasses the experimental factors of system, topic, shard, and their interaction effects and discovers that the topic*shard interaction effect is a large effect almost globally across all datasets.
A Query Taxonomy Describes Performance of Patient-Level Retrieval from Electronic Health Record Data
Evaluating factors that might affect information retrieval methods and the interplay between commonly used IR approaches and the characteristics of the cohort definition structure found no strong association between these characteristics and patient retrieval performance, but some of the characteristics derived from a query taxonomy could lead to improved selection of approaches.
Siamese Meta-Learning and Algorithm Selection with 'Algorithm-Performance Personas' [Proposal]
This work proposes a Siamese Neural Network architecture for automated algorithm selection that focuses more on 'alike performing' instances than meta-features, and introduces further the concept of 'Algorithm Performance Personas' that describe instances for which the single algorithms perform alike.
‘Algorithm-Performance Personas’ for Siamese Meta-Learning and Automated Algorithm Selection
The proposed method is to train a Siamese Network to learn an embedding of instances, clustering according to both feature similarity and prior algorithm performances, which enables classic neighbourhood methods to be used in generating an algorithm ranking for new instances.
Siamese Algorithm Selection: A Novel Approach to Automated Algorithm Selection
Siamese Algorithm Selection (SAS) is proposed as a new method of per-instance algorithm selection, utilizing a Siamese Neural Network (SNN) to learn Algorithm Performance Personas (APP), which are neighbourhoods of instances that map to similar performances.


Query-performance prediction: setting the expectations straight
Focusing on this specific prediction task, namely query ranking by presumed effectiveness, a novel learning-to-rank-based approach that uses Markov Random Fields is presented and the resultant prediction quality substantially transcends that of state-of-the-art predictors.
A survey of pre-retrieval query performance predictors
In this poster, 22 pre-retrieval predictors are categorized and assessed on three different TREC test collections and such predictors base their predictions solely on query terms, the collection statistics and possibly external sources such as WordNet.
Research Frontiers in Information Retrieval Report from the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018)
The intent is that this description of open problems will help to inspire researchers and graduate students to address the questions, and will provide funding agencies data to focus and coordinate support for information retrieval research.
Predicting query performance
It is suggested that clarity scores measure the ambiguity of a query with respect to a collection of documents and show that they correlate positively with average precision in a variety of TREC test sets.
Reproducibility Challenges in Information Retrieval Evaluation
  • N. Ferro
  • Computer Science
    ACM J. Data Inf. Qual.
  • 2017
Experimental evaluation relies on the Cranfield paradigm, which makes use of experimental collections, consisting of documents, sampled from a real domain of interest; topics, representing real user information needs in that domain; and relevance judgements, determining which documents are relevant to which topics.
Evaluating collaborative filtering recommender systems
The key decisions in evaluating collaborative filtering recommender systems are reviewed: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole.
Estimating the query difficulty for information retrieval
This tutorial is to expose participants to the current research on query performance prediction (also known as query difficulty estimation), and participants will become familiar with states-of-the-art performance prediction methods, and with common evaluation methodologies for prediction quality.
Blind Men and Elephants: Six Approaches to TREC data
The paper reviews six recent efforts to better understand performance measurements on information retrieval (IR) systems within the framework of the Text REtrieval Conferences (TREC): analysis of
Reliable Information Access Final Workshop Report
For many years the standard approach to question answering, or searching for information, has involved information retrieval systems, and the current statistical approaches to IR have shown themselves to be effective and reliable in both research and commercial settings.
Statistical biases in Information Retrieval metrics for recommender systems
This paper lays out an experimental configuration framework upon which to identify and analyse specific statistical biases arising in the adaptation of Information Retrieval metrics to recommendation tasks, namely sparsity and popularity biases.