The Dagstuhl Perspectives Workshop on Performance Modeling and Prediction

@article{Ferro2018TheDP,
  title={The Dagstuhl Perspectives Workshop on Performance Modeling and Prediction},
  author={N. Ferro and Norbert Fuhr and Gregory Grefenstette and Joseph A. Konstan and Pablo Castells and Elizabeth M. Daly and Thierry Declerck and Michael D. Ekstrand and Werner Geyer and Julio Gonzalo and Tsvi Kuflik and Krister Lind{\'e}n and Bernardo Magnini and Jian-Yun Nie and R. Perego and Bracha Shapira and Ian Soboroff and Nava Tintarev and Karin M. Verspoor and Martijn C. Willemsen and Justin Zobel},
  journal={SIGIR Forum},
  year={2018},
  volume={52},
  pages={91-101}
}
This paper reports the findings of the Dagstuhl Perspectives Workshop 17442 on performance modeling and prediction in the domains of Information Retrieval, Natural language Processing and Recommender Systems. We present a framework for further research, which identifies five major problem areas: understanding measures, performance analysis, making underlying assumptions explicit, identifying application features determining performance, and the development of prediction models describing the… 

Figures from this paper

Report on GLARE 2018: 1st Workshop on Generalization in Information Retrieval

This is a report on the first edition of the International Workshop on Generalization in Information Retrieval (GLARE 2018), co-located with the 27th ACM International Conference on Information and

Using Collection Shards to Study Retrieval Performance Effect Sizes

This work uses the general linear mixed model framework and presents a model that encompasses the experimental factors of system, topic, shard, and their interaction effects and discovers that the topic*shard interaction effect is a large effect almost globally across all datasets.

Towards Unified Metrics for Accuracy and Diversity for Recommender Systems

This work proposes a novel adaptation of a unified metric, derived from one commonly used for search system evaluation, to Recommender Systems, and shows that the metric respects the desired theoretical constraints and behaves as expected when performing offline evaluation.

The Information Retrieval Group at the University of Duisburg-Essen

This document describes the IR research group at the University of Duisburg-Essen, which works on quantitative models of interactive retrieval, social media analysis, multilingual argument retrieval

Memorization vs. Generalization : Quantifying Data Leakage in NLP Performance Evaluation

This work identifies leakage of training data into test data on several publicly available datasets used to evaluate NLP tasks, including named entity recognition and relation extraction, and study them to assess the impact of that leakage on the model’s ability to memorize versus generalize.

Tasks as needs: reframing the paradigm of clinical natural language processing research for real-world decision support

Abstract Electronic medical records are increasingly used to store patient information in hospitals and other clinical settings. There has been a corresponding proliferation of clinical natural

CLEF 2019 : Overview of the Replicability and Reproducibility Tasks

The aim of CENTRE is to run both a replicability and reproducibility challenge across all the major IR evaluation campaigns and to provide the IR community with a venue where previous research results can be explored and discussed.

CENTRE@CLEF2019: Overview of the Replicability and Reproducibility Tasks

The aim of CENTRE is to run both a replicability and reproducibility challenge across all the major IR evaluation campaigns and to provide the IR community with a venue where previous research results can be explored and discussed.

Reproducibility and Validity in CLEF

  • N. Fuhr
  • Computer Science
    Information Retrieval Evaluation in a Changing World
  • 2019
It is shown that CLEF has not only produced test collections that can be re-used by other researchers, but also undertaken various efforts in enabling reproducibility.

Overview of CENTRE@CLEF 2019: Sequel in the Systematic Reproducibility Realm

The aim of CENTRE is to run both a replicability and reproducibility challenge across all the major IR evaluation campaigns and to provide the IR community with a venue where previous research results can be explored and discussed.

References

SHOWING 1-10 OF 19 REFERENCES

Blind Men and Elephants: Six Approaches to TREC data

The paper reviews six recent efforts to better understand performance measurements on information retrieval (IR) systems within the framework of the Text REtrieval Conferences (TREC): analysis of

Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science"

This paper discusses, summarize, and adapt the main findings of the Dagstuhl seminar to the context of IR evaluation -- both system-oriented and user-oriented -- in order to raise awareness in the community and stimulate the fields towards and increased reproducibility of the authors' experiments.

Reproducibility Challenges in Information Retrieval Evaluation

  • N. Ferro
  • Computer Science
    ACM J. Data Inf. Qual.
  • 2017
Experimental evaluation relies on the Cranfield paradigm, which makes use of experimental collections, consisting of documents, sampled from a real domain of interest; topics, representing real user information needs in that domain; and relevance judgements, determining which documents are relevant to which topics.

On per-topic variance in IR evaluation

This work explores the notion, put forward by Cormack & Lynam and Robertson, that a document collection used for Cranfield-style experiments should be considered as a sample from some larger population of documents by simulating other samples from the same large population.

Toward an anatomy of IR system component performances

A methodology based on the General Linear Mixed Model (GLMM) and analysis of variance (ANOVA) is proposed to develop statistical models able to isolate system variance and component effects as well as their interaction, by relying on a grid of points containing all the combinations of the analyzed components.

A Statistical Analysis of the TREC-3 Data

A statistical analysis of the TREC-3 data shows that performance differences across queries is greater than performance differences across participants runs. Generally, groups of runs which do not

Using Replicates in Information Retrieval Evaluation

A method for more accurately estimating the main effect of the system in a typical test-collection-based evaluation of information retrieval systems, thus increasing the sensitivity of system comparisons and robust against small changes in the number of partitions used.

Rank-biased precision for measurement of retrieval effectiveness

A new effectiveness metric, rank-biased precision, is introduced that is derived from a simple model of user behavior, is robust if answer rankings are extended to greater depths, and allows accurate quantification of experimental uncertainty, even when only partial relevance judgments are available.

Are IR Evaluation Measures on an Interval Scale?

In this paper, we formally investigate whether, or not, IR evaluation measures are on an interval scale, which is needed to safely compute the basic statistics, such as mean and variance, we daily

Expected reciprocal rank for graded relevance

This work presents a new editorial metric for graded relevance which overcomes this difficulty and implicitly discounts documents which are shown below very relevant documents and calls it Expected Reciprocal Rank (ERR).