The Dagstuhl Perspectives Workshop on Performance Modeling and Prediction

  title={The Dagstuhl Perspectives Workshop on Performance Modeling and Prediction},
  author={N. Ferro and Norbert Fuhr and Gregory Grefenstette and Joseph A. Konstan and Pablo Castells and Elizabeth M. Daly and Thierry Declerck and Michael D. Ekstrand and Werner Geyer and Julio Gonzalo and Tsvi Kuflik and Krister Lind{\'e}n and Bernardo Magnini and Jian-Yun Nie and R. Perego and Bracha Shapira and Ian Soboroff and Nava Tintarev and Karin M. Verspoor and Martijn C. Willemsen and Justin Zobel},
  journal={SIGIR Forum},
This paper reports the findings of the Dagstuhl Perspectives Workshop 17442 on performance modeling and prediction in the domains of Information Retrieval, Natural language Processing and Recommender Systems. We present a framework for further research, which identifies five major problem areas: understanding measures, performance analysis, making underlying assumptions explicit, identifying application features determining performance, and the development of prediction models describing the… 

Figures from this paper

Report on GLARE 2018: 1st Workshop on Generalization in Information Retrieval
This is a report on the first edition of the International Workshop on Generalization in Information Retrieval (GLARE 2018), co-located with the 27th ACM International Conference on Information and
Causality, prediction and improvements that (don’t) add up
The relationship of the problem of performance prediction in IR and related fields to causal inference is shown, and how methods from this area may be helpful in the field.
Using Collection Shards to Study Retrieval Performance Effect Sizes
This work uses the general linear mixed model framework and presents a model that encompasses the experimental factors of system, topic, shard, and their interaction effects and discovers that the topic*shard interaction effect is a large effect almost globally across all datasets.
Assessing ranking metrics in top-N recommendation
A principled analysis of the robustness and the discriminative power of different ranking metrics for the offline evaluation of recommender systems is undertaken, drawing from previous studies in the information retrieval field.
Towards Unified Metrics for Accuracy and Diversity for Recommender Systems
This work proposes a novel adaptation of a unified metric, derived from one commonly used for search system evaluation, to Recommender Systems, and shows that the metric respects the desired theoretical constraints and behaves as expected when performing offline evaluation.
The Information Retrieval Group at the University of Duisburg-Essen
This document describes the IR research group at the University of Duisburg-Essen, which works on quantitative models of interactive retrieval, social media analysis, multilingual argument retrieval
Memorization vs. Generalization : Quantifying Data Leakage in NLP Performance Evaluation
This work identifies leakage of training data into test data on several publicly available datasets used to evaluate NLP tasks, including named entity recognition and relation extraction, and study them to assess the impact of that leakage on the model’s ability to memorize versus generalize.
SIGIR Keynote: Proof By Experimentation? Towards Better IR Research
Most IR experiments lack both internal and external validity. Top performance is mostly an illusion, given the lack of solid statistical evidence. Thus, reviewers should ignore performance numbers in
How do interval scales help us with better understanding IR evaluation measures?
An extensive evaluation is carried out, based on standard TREC collections, to study how the theoretical findings of this work impact on the experimental ones and a correlation analysis is conducted to study the relationship among the above-mentioned state-of-the-art evaluation measures and their scales.
CLEF 2019 : Overview of the Replicability and Reproducibility Tasks
The aim of CENTRE is to run both a replicability and reproducibility challenge across all the major IR evaluation campaigns and to provide the IR community with a venue where previous research results can be explored and discussed.


Blind Men and Elephants: Six Approaches to TREC data
The paper reviews six recent efforts to better understand performance measurements on information retrieval (IR) systems within the framework of the Text REtrieval Conferences (TREC): analysis of
Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science"
This paper discusses, summarize, and adapt the main findings of the Dagstuhl seminar to the context of IR evaluation -- both system-oriented and user-oriented -- in order to raise awareness in the community and stimulate the fields towards and increased reproducibility of the authors' experiments.
Reproducibility Challenges in Information Retrieval Evaluation
  • N. Ferro
  • Computer Science
    ACM J. Data Inf. Qual.
  • 2017
Experimental evaluation relies on the Cranfield paradigm, which makes use of experimental collections, consisting of documents, sampled from a real domain of interest; topics, representing real user information needs in that domain; and relevance judgements, determining which documents are relevant to which topics.
On per-topic variance in IR evaluation
This work explores the notion, put forward by Cormack & Lynam and Robertson, that a document collection used for Cranfield-style experiments should be considered as a sample from some larger population of documents by simulating other samples from the same large population.
Toward an anatomy of IR system component performances
A methodology based on the General Linear Mixed Model (GLMM) and analysis of variance (ANOVA) is proposed to develop statistical models able to isolate system variance and component effects as well as their interaction, by relying on a grid of points containing all the combinations of the analyzed components.
A Statistical Analysis of the TREC-3 Data
A statistical analysis of the TREC-3 data shows that performance differences across queries is greater than performance differences across participants runs. Generally, groups of runs which do not
Using Replicates in Information Retrieval Evaluation
A method for more accurately estimating the main effect of the system in a typical test-collection-based evaluation of information retrieval systems, thus increasing the sensitivity of system comparisons and robust against small changes in the number of partitions used.
Rank-biased precision for measurement of retrieval effectiveness
A new effectiveness metric, rank-biased precision, is introduced that is derived from a simple model of user behavior, is robust if answer rankings are extended to greater depths, and allows accurate quantification of experimental uncertainty, even when only partial relevance judgments are available.
Are IR Evaluation Measures on an Interval Scale?
In this paper, we formally investigate whether, or not, IR evaluation measures are on an interval scale, which is needed to safely compute the basic statistics, such as mean and variance, we daily
Expected reciprocal rank for graded relevance
This work presents a new editorial metric for graded relevance which overcomes this difficulty and implicitly discounts documents which are shown below very relevant documents and calls it Expected Reciprocal Rank (ERR).