The Philosophy of Information Retrieval Evaluation

@inproceedings{Voorhees2001ThePO,
  title={The Philosophy of Information Retrieval Evaluation},
  author={Ellen M. Voorhees},
  booktitle={CLEF},
  year={2001}
}
  • E. Voorhees
  • Published in CLEF 3 September 2001
  • Computer Science
Evaluation conferences such as TREC, CLEF, and NTCIR are modern examples of the Cranfield evaluation paradigm. In Cranfield, researchers perform experiments on test collections to compare the relative effectiveness of different retrieval approaches. The test collections allow the researchers to control the effects of different system parameters, increasing the power and decreasing the cost of retrieval experiments as compared to user-based evaluations. This paper reviews the fundamental… 
Low-cost and robust evaluation of information retrieval systems
TLDR
Through adopting a view of evaluation that is more concerned with distributions over performance differences rather than estimates of absolute performance, the expected cost can be minimized so as to reliably differentiate between engines with less than 1% of the human effort that has been used in past experiments.
Evaluation for Multilingual Information Retrieval Systems
This chapter discusses IR system evaluation with particular reference to the multilingual context, and presents the most commonly used measures and models. The main focus is on system performance
Exploration of Term Relevance Sets
TLDR
IR system evaluation based on Term Relevance Sets (Trels) represents an alternative for these IR systems and the technique is still novel and little documentation exists.
A generic approach to component-level evaluation in information retrieval
TLDR
The focus of the thesis at hand is on the key components that are needed to address typical ad-hoc search tasks, like finding books on a particular topic in a large set of library records in order to eliminate black box retrieval systems.
Scaling IR-system evaluation using term relevance sets
TLDR
An evaluation method based on Term Relevance Sets Trels that measures an IR system's quality by examining the content of the retrieved results rather than by looking for pre-specified relevant pages is described.
How many performance measures to evaluate information retrieval systems?
TLDR
The study is carried out from a massive data analysis of TREC results and it is shown that the 130 measures calculated by trec_eval for individual queries can be clustered into homogeneous clusters.
Comparative Evaluation of Multilingual Information Access Systems
TLDR
The paper discusses the evaluation approach adopted, describes the tracks and tasks offered and the test collections used, and provides an outline of the guidelines given to the participants.
Incremental test collections
TLDR
An algorithm that intelligently selects documents to be judged and decides when to stop in such a way that with very little work there can be a high degree of confidence in the result of the evaluation is presented.
The effect of assessor error on IR system evaluation
TLDR
This paper examines the robustness of the TREC Million Query track methods when some assessors make significant and systematic errors, and finds that while averages are robust, assessor errors can have a large effect on system rankings.
...
...

References

SHOWING 1-10 OF 39 REFERENCES
How reliable are the results of large-scale information retrieval experiments?
TLDR
A detailed empirical investigation of the TREC results shows that the measured relative performance of systems appears to be reliable, but that recall is overestimated: it is likely that many relevant documents have not been found.
INFORMATION RETRIEVAL TEST COLLECTIONS
TLDR
This short review does not attempt a fully documented survey of all the collections used in the past decade, but representative examples have been studied to throw light on the requirements test collections should meet, and to suggest guidelines for a future ‘ideal’ test collection.
Why batch and user evaluations do not give the same results
TLDR
Assessment of the TREC Interactive Track showed that while the queries entered by real users into systems yielding better results in batch studies gave comparable gains in ranking of relevant documents for those users, they did not translate into better performance on specific tasks.
Efficient construction of large test collections
TLDR
This work proposes two methods, Intemctive Searching and Judging and Moveto-front Pooling, that yield effective test collections while requiring many fewer judgements.
Overview of the Eighth Text REtrieval Conference (TREC-8)
TLDR
The eighth Text REtrieval Conference TREC was held at the National Institute of Standards and Tech nology NIST on November and outlined the goals of the series of workshops designed to foster research in text retrieval.
CLEF 2000 - Overview of Results
TLDR
Details of the various subtasks are presented, and the analysis indicates that the CLEF relevance assessments are of comparable quality to those of the well-known and trusted TREC ad-hoc collections.
The NTCIR Workshop : the First Evaluation Workshop on Japanese Text Retrieval and Cross-Lingual Information Retrieval
TLDR
The outline of the first NTCIR Workshop is introduced, which is the first evaluation workshop designed to enhance research in Japanese text retrieval and cross-lingual information retrieval and some thoughts on the future directions are suggested.
Information Retrieval Experiment
TLDR
The volume's incohesiveness makes this reviewer question the audience for the volume, and there are a number of quality papers in this volume; perhaps it can be best used to reference the included papers on an individual basis.
Cross-Language Information Retrieval and Evaluation
  • C. Peters
  • Psychology
    Lecture Notes in Computer Science
  • 2001
TLDR
When you read more every page of this cross language information retrieval and evaluation, what you will obtain is something great.
...
...