Manifesto from Dagstuhl Perspectives Workshop 17442 - From Evaluating to Forecasting Performance: How to Turn Information Retrieval, Natural Language Processing and Recommender Systems into Predictive Sciences

  title={Manifesto from Dagstuhl Perspectives Workshop 17442 - From Evaluating to Forecasting Performance: How to Turn Information Retrieval, Natural Language Processing and Recommender Systems into Predictive Sciences},
  author={N. Ferro and Fuhr Norbert and Grefenstette Gregory and Joseph A. Konstan and Castells Pablo and Elizabeth M. Daly and Declerck Thierry and Michael D. Ekstrand and Geyer Werner and Gonzalo Julio and Kuflik Tsvi and Lind{\'e}n Krister and Ma. Theresa H. Bernardo and Nie Jian-yun and Perego Raffaele and Bracha Shapira and Soboroff Ian and Tintarev Nava and Verspoor Karin and Martijn C. Willemsen and Zobel Justin},
We describe the state-of-the-art in performance modeling and prediction for Information Retrieval (IR), Natural Language Processing (NLP) and Recommender Systems (RecSys) along with its shortcomings and strengths. We present a framework for further research, identifying five major problem areas: understanding measures, performance analysis, making underlying assumptions explicit, identifying application features determining performance, and the development of prediction models describing the… 

Figures from this paper

Using Collection Shards to Study Retrieval Performance Effect Sizes
This work uses the general linear mixed model framework and presents a model that encompasses the experimental factors of system, topic, shard, and their interaction effects and discovers that the topic*shard interaction effect is a large effect almost globally across all datasets.
Report on GLARE 2018: 1st Workshop on Generalization in Information Retrieval
This is a report on the first edition of the International Workshop on Generalization in Information Retrieval (GLARE 2018), co-located with the 27th ACM International Conference on Information and
Centre@clef 2019
CENTRE, a joint CLEF/TREC/NTCIR lab which aims at raising the attention on reproducibility of experimental results, focuses on three objectives, e.g. replicability, reproducible and generalizability, and for each of them a dedicated task is designed.
What Happened in CLEF \ldots For a While?
A summary of the motivations which led to the establishment of CLEF is provided, and a description of how it has evolved over the years, the major achievements, and what the next challenges are are described.
What Happened in CLEF. . . For a While?
2019 marks the 20 birthday for CLEF, an evaluation campaign activity which has applied the Cranfield evaluation paradigm to the testing of multilingual and multimodal information access systems in
Evaluating Multimedia and Language Tasks
The TRECVID Evaluations of Multimedia Access began in 2001 with a goal of driving content-based search technology for multimedia just as its progenitor, the Text Retrieval Conference (TREC) did for text and web1.


Query-performance prediction: setting the expectations straight
Focusing on this specific prediction task, namely query ranking by presumed effectiveness, a novel learning-to-rank-based approach that uses Markov Random Fields is presented and the resultant prediction quality substantially transcends that of state-of-the-art predictors.
Research Frontiers in Information Retrieval Report from the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018)
The intent is that this description of open problems will help to inspire researchers and graduate students to address the questions, and will provide funding agencies data to focus and coordinate support for information retrieval research.
Predicting query performance
It is suggested that clarity scores measure the ambiguity of a query with respect to a collection of documents and show that they correlate positively with average precision in a variety of TREC test sets.
Reproducibility Challenges in Information Retrieval Evaluation
  • N. Ferro
  • Computer Science
    ACM J. Data Inf. Qual.
  • 2017
Experimental evaluation relies on the Cranfield paradigm, which makes use of experimental collections, consisting of documents, sampled from a real domain of interest; topics, representing real user information needs in that domain; and relevance judgements, determining which documents are relevant to which topics.
Evaluation-as-a-Service: Overview and Outlook
The objective of this white paper are to summarize and compare the current approaches and consolidate the experiences of these approaches to outline the next steps of EaaS, particularly towards sustainable research infrastructures.
A survey of pre-retrieval query performance predictors
In this poster, 22 pre-retrieval predictors are categorized and assessed on three different TREC test collections and such predictors base their predictions solely on query terms, the collection statistics and possibly external sources such as WordNet.
Reliable Information Access Final Workshop Report
For many years the standard approach to question answering, or searching for information, has involved information retrieval systems, and the current statistical approaches to IR have shown themselves to be effective and reliable in both research and commercial settings.
Estimating the query difficulty for information retrieval
This tutorial is to expose participants to the current research on query performance prediction (also known as query difficulty estimation), and participants will become familiar with states-of-the-art performance prediction methods, and with common evaluation methodologies for prediction quality.
Toward an anatomy of IR system component performances
A methodology based on the General Linear Mixed Model (GLMM) and analysis of variance (ANOVA) is proposed to develop statistical models able to isolate system variance and component effects as well as their interaction, by relying on a grid of points containing all the combinations of the analyzed components.
Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness
This paper presents a formal framework to define and study the properties of utility-oriented measurements of retrieval effectiveness, like AP, RBP, ERR and many other popular IR evaluation measures, thus contributing to explicitly link IR evaluation to a broader context.