How reliable are the results of large-scale information retrieval experiments?

  title={How reliable are the results of large-scale information retrieval experiments?},
  author={Justin Zobel},
  booktitle={SIGIR '98},
  • J. Zobel
  • Published in SIGIR '98 1 August 1998
  • Computer Science
Two stages in measurement of techniques for informationretrieval are gathering of documents for relevance assessment anduse of the assessments to numerically evaluate effectiveness. Weconsider both of these stages in the context of the TRECexperiments, to determine whether they lead to measurements thatare trustworthy and fair. Our detailed empirical investigation ofthe TREC results shows that the measured relative performance ofsystems appears to be reliable, but that recall is overestimated… 

Figures and Tables from this paper

The effect of assessor error on IR system evaluation
This paper examines the robustness of the TREC Million Query track methods when some assessors make significant and systematic errors, and finds that while averages are robust, assessor errors can have a large effect on system rankings.
Retrieval evaluation with incomplete information
It is shown that current evaluation measures are not robust to substantially incomplete relevance judgments, and a new measure is introduced that is both highly correlated with existing measures when complete judgments are available and more robust to incomplete judgment sets.
Research methodology in studies of assessor effort for information retrieval evaluation
It is demonstrated that baseline performance on the standard data sets is quite high, necessitating strong evidence to support claims, and it is argued that the standard of evidence in evaluation studies should be increased to the level required of text retrieval studies.
Robust test collections for retrieval evaluation
This work formally defines what it means for judgments to be reusable: the confidence in an evaluation of new systems can be accurately assessed from an existing set of relevance judgments, and presents a method for augmenting a set ofrelevant judgments with relevance estimates that require no additional assessor effort.
Rank-biased precision for measurement of retrieval effectiveness
A new effectiveness metric, rank-biased precision, is introduced that is derived from a simple model of user behavior, is robust if answer rankings are extended to greater depths, and allows accurate quantification of experimental uncertainty, even when only partial relevance judgments are available.
Reliable information retrieval evaluation with incomplete and biased judgements
This work compares the performance of this method with other approaches to the problem of incomplete judgements, such as bpref, and shows that the proposed method leads to higher evaluation accuracy, especially if the set of manual judgements is rich in documents, but highly biased against some systems.
On the robustness of relevance measures with incomplete judgments
This work investigates the robustness of three widely used IR relevance measures for large data collections with incomplete judgments and shows that NDCG consistently performs better than both bpref and infAP.
Low-cost and robust evaluation of information retrieval systems
Through adopting a view of evaluation that is more concerned with distributions over performance differences rather than estimates of absolute performance, the expected cost can be minimized so as to reliably differentiate between engines with less than 1% of the human effort that has been used in past experiments.
Score Estimation, Incomplete Judgments, and Significance Testing in IR Evaluation
The design options that must be considered when planning an experimental evaluation of information retrieval systems are explored, with emphasis on how effectiveness scores are inferred from partial information.


Relevance Judgments for Assessing Recall
Relevance assessments and retrieval system evaluation
The Cranfield II Relevance Assessments: A Critical Evaluation
It is shown that numerical measures of retrieval effectiveness may be greatly altered by consideration of the "missing" relevant documents and that a ranking of retrieval methods according to order of performance may vary as well.
Some Unexplained Aspects of the Cranfield Tests of Indexing Performance Factors
Statistical reasoning is used to show that very likely there must have been several times as many relevant " document-question matches" as were actually found by the Cranfield searchers in the process of determining "all possible" relevance matches.
A critical investigation of recall and precision as measures of retrieval system performance
This paper systematically investigates the various problems and issues associated with the use of recall and precision as measures of retrieval system performance and provides a comparative analysis of methods available for defining precision in a probabilistic sense to promote a better understanding of the various issues involved in retrieval performance evaluation.
Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness
It is argued that a series of thorough, rigorous, and extensive tests is needed of precisely how, and under what conditions, variations in relevance assessments do, and do not, affect measures of retrieval performance.
Relevance Judgements for Assessing Recall
Two styles of information need are distinguished, high precision and high recall, and a method of forming relevance judgements suitable for each is described, illustrated by comparing two retrieval systems, keyword retrieval and semantic signatures, on di erent sets of relevanceJudgements.
The Pragmatics of Information Retrieval Experimentation Revisited
The State of Retrieval System Evaluation
Statistical inference in retrieval effectiveness evaluation
  • J. Savoy
  • Computer Science
    Inf. Process. Manag.
  • 1997