Efficient construction of large test collections

@inproceedings{Cormack1998EfficientCO,
  title={Efficient construction of large test collections},
  author={Gordon V. Cormack and Christopher R. Palmer and Charles L. A. Clarke},
  booktitle={SIGIR '98},
  year={1998}
}
Test collections with a million or more documents are needed for the evaluation of modern information retrieval systems. Yet their construction requires a great deal of effort. Judgements must be rendered as to whether or not documents are relevant to each of a set of queries. Exhaustive judging, in which every document is examined and a judgement rendered, is infeasible for collections of this size. Current practice is represented by the “pooling method”, as used in the TREC conference series… 

Figures and Tables from this paper

Active Sampling for Large-scale Information Retrieval Evaluation
TLDR
An active sampling method is devised that avoids the bias of the active selection methods towards good systems, and at the same time reduces the variance of the current sampling approaches by placing a distribution over systems, which varies as judgments become available.
Incremental test collections
TLDR
An algorithm that intelligently selects documents to be judged and decides when to stop in such a way that with very little work there can be a high degree of confidence in the result of the evaluation is presented.
Optimizing the construction of information retrieval test collections
TLDR
A probabilistic model is developed that provides accurate relevance judgments with a smaller number of labels collected per document, and should assist research institutes and commercial search engines to construct test collections where there are large document collections and large query logs, but where economic constraints prohibit gathering comprehensive relevance judgments.
Exploiting Pooling Methods for Building Datasets for Novel Tasks
TLDR
This article proposes the design of a system for building test collections easily and cheaply by implementing state-of-the-art pooling strategies and simulating competition participants with different retrieval models and query variants.
Bias and the limits of pooling for large collections
TLDR
It is shown that the judgment sets produced by traditional pooling when the pools are too small relative to the total document set size can be biased in that they favor relevant documents that contain topic title words.
An Active Learning Approach to Ecien tly Ranking Retrieval Engines
TLDR
This paper introduces an active learning algorithm whose goal is to reach the correct rankings using the smallest possible number of relevance judgments, always trying to select the document with the highest information gain.
Using document similarity networks to evaluate retrieval systems
TLDR
This work presents a novel technique for evaluating retrieval systems using minimal human judgments and shows that it can effectively compare different retrieval system using very few relevance judgments and at the same time achieve a high correlation with the true rankings of systems.
A Practical Sampling Strategy for Efficient Retrieval Evaluation
TLDR
A new method for large-scale retrieval evaluation based on random sampling which combines the strengths of each of the above methods is proposed, and can be adapted to incorporate both randomly sampled and fixed relevance judgments, as were available in the most recent TREC Terabyte track.
Ranking retrieval systems without relevance judgments
TLDR
The initial results of a new evaluation methodology which replaces human relevance judgments with a randomly selected mapping of documents to topics are proposed, which are referred to aspseudo-relevance judgments.
...
...

References

SHOWING 1-10 OF 34 REFERENCES
Building a Large Multilingual Test Collection from Comparable News Documents
We present a novel approach to constructing a large test collection for evaluation of information retrieval systems. This approach relies on a collection of time-sensitive documents, like news
INFORMATION RETRIEVAL TEST COLLECTIONS
TLDR
This short review does not attempt a fully documented survey of all the collections used in the past decade, but representative examples have been studied to throw light on the requirements test collections should meet, and to suggest guidelines for a future ‘ideal’ test collection.
Variations in relevance judgments and the measurement of retrieval effectiveness
TLDR
Very high correlations were found among the rankings of systems produced using diAerent relevance judgment sets, indicating that the comparative evaluation of retrieval performance is stable despite substantial diAerences in relevance judgments, and thus reaArm the use of the TREC collections as laboratory tools.
The Sixth Text REtrieval Conference (TREC-6)
Statistical bases of relevance assessment for the ideal information retrieval test collection
TLDR
The Report is chiefly devoted to the work done by H. Gilbert on a three month project supported by BLR&DD Grant SI/G/267, and attempts to provide a self-contained and integrated discussion of the whole question of statistically-adequate assessment for retrieval experiment evaluation.
Overview of the first TREC conference
TLDR
There was a large variety of retrieval techniques reported on, including methods using automatic thesaurii, sophisticated term weighting, natural language techniques, relevance feedback, and advanced pattern matching.
Overview of the Third Text REtrieval Conference (TREC-3)
TLDR
This conference became the first in a series of ongoing conferences dedicated to encouraging research in retrieval from large-scale test collections, and to encouraging increased interaction among research groups in industry and academia.
How reliable are the results of large-scale information retrieval experiments?
TLDR
A detailed empirical investigation of the TREC results shows that the measured relative performance of systems appears to be reliable, but that recall is overestimated: it is likely that many relevant documents have not been found.
...
...