Corpus ID: 17569964

A Practical Sampling Strategy for Efficient Retrieval Evaluation

@inproceedings{Aslam2007APS,
  title={A Practical Sampling Strategy for Efficient Retrieval Evaluation},
  author={J. Aslam and Virgil Pavlu},
  year={2007}
}
We consider the problem of large-scale retrieval evaluation, with a focus on the considerable effort required to judge tens of thousands of documents using traditional test collection construction methodologies. Recently, two methods based on random sampling were proposed to help alleviate this burden: While the first method proposed by Aslam et al. is very accurate and efficient, it is also very complex, and while the second method proposed by Yilmaz et al. is relatively simple, its accuracy… Expand
Active Sampling for Large-scale Information Retrieval Evaluation
TLDR
An active sampling method is devised that avoids the bias of the active selection methods towards good systems, and at the same time reduces the variance of the current sampling approaches by placing a distribution over systems, which varies as judgments become available. Expand
Automatic methods for low-cost evaluation and position-aware models for neural information retrieval
TLDR
Novel neural IR models are developed to incorporate different patterns like term dependency, query proximity, density of relevance, and query coverage in a single model, inspired by the recent advances in deep learning. Expand
Optimizing the construction of information retrieval test collections
TLDR
A probabilistic model is developed that provides accurate relevance judgments with a smaller number of labels collected per document, and should assist research institutes and commercial search engines to construct test collections where there are large document collections and large query logs, but where economic constraints prohibit gathering comprehensive relevance judgments. Expand
Learning to Effectively Select Topics For Information Retrieval Test Collections
TLDR
A new learning-to-rank topic selection method is proposed which reduces the number of search topics needed for reliable evaluation of IR systems and shows not only its ability to reliably evaluate IR systems using fewer topics, but also that when topics are intelligently selected, deep judging is often more cost-effective than shallow judging in achieving the same level of evaluation reliability. Expand
Intelligent topic selection for low-cost information retrieval evaluation: A New perspective on deep vs. shallow judging
TLDR
This paper proposes a new intelligent topic selection method which reduces the number of search topics (and thereby costly human relevance judgments) needed for reliable IR evaluation and challenges conventional wisdom in showing that deep judging is often preferable to shallow judging when topics are selected intelligently. Expand
Pooling-based continuous evaluation of information retrieval systems
TLDR
This paper proposes a new IR evaluation methodology based on pooled test-collections and on the continuous use of either crowdsourcing or professional editors to obtain relevance judgements, and proposes two metrics: Fairness Score, and opportunistic number of relevant documents, which are used to define new pooling strategies. Expand
Million Query Track 2007 Overview
TLDR
The Million Query track ran for the first time in TREC 2007 was an exploration of ad-hoc retrieval on a large collection of documents and investigated questions of system evaluation, particularly whether it is better to evaluate using many shallow judgments or fewer thorough judgments. Expand
Increasing the Efficiency of High-Recall Information Retrieval
TLDR
It is hypothesize that total assessment effort to achieve high recall can be reduced by using shorter document excerpts in place of full documents for the assessment of relevance and using a high-recall retrieval system based on continuous active learning (CAL). Expand
Information Retrieval Evaluation
  • D. Harman
  • Computer Science
  • Information Retrieval Evaluation
  • 2011
TLDR
This lecture starts with a discussion of the early evaluation of information retrieval systems, starting with the Cranfield testing in the early 1960s, continuing with the Lancaster "user" study for MEDLARS, and presenting the various test collection investigations by the SMART project and by groups in Britain. Expand
A generic approach to component-level evaluation in information retrieval
TLDR
The focus of the thesis at hand is on the key components that are needed to address typical ad-hoc search tasks, like finding books on a particular topic in a large set of library records in order to eliminate black box retrieval systems. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 24 REFERENCES
Efficient construction of large test collections
TLDR
This work proposes two methods, Intemctive Searching and Judging and Moveto-front Pooling, that yield effective test collections while requiring many fewer judgements. Expand
Ranking retrieval systems without relevance judgments
TLDR
The initial results of a new evaluation methodology which replaces human relevance judgments with a randomly selected mapping of documents to topics are proposed, which are referred to aspseudo-relevance judgments. Expand
A statistical method for system evaluation using incomplete judgments
TLDR
This work considers the problem of large-scale retrieval evaluation, and proposes a statistical method for evaluating retrieval systems using incomplete judgments based on random sampling, which produces unbiased estimates of the standard measures themselves. Expand
Minimal test collections for retrieval evaluation
TLDR
This work links evaluation with test collection construction to gain an understanding of the minimal judging effort that must be done to have high confidence in the outcome of an evaluation. Expand
A unified model for metasearch, pooling, and system evaluation
TLDR
A unified model is presented which simultaneously solves the problems of fusing the ranked lists of documents in order to obtain a high-quality combined list (metasearch); generating document collections likely to contain large fractions of relevant documents (pooling); and accurately evaluating the underlying retrieval systems with small numbers of relevance judgments (efficient system assessment). Expand
Estimating average precision with incomplete and imperfect judgments
TLDR
This work proposes three evaluation measures that are approximations to average precision even when the relevance judgments are incomplete and are more robust to incomplete or imperfect relevance judgments than bpref, and proposes estimates of average precision that are simple and accurate. Expand
Query Hardness Estimation Using Jensen-Shannon Divergence Among Multiple Scoring Functions
TLDR
It is proposed that the results returned by multiple retrieval engines will be relatively similar for "easy" queries but more diverse for "difficult" queries. Expand
Measure-based metasearch
TLDR
Experimental results indicate that system-oriented measures of overall retrieval performance (such as average precision) yield good metasearch algorithms whose performance equals or exceeds that of benchmark techniques such as CombMNZ and Condorcet. Expand
Retrieval evaluation with incomplete information
TLDR
It is shown that current evaluation measures are not robust to substantially incomplete relevance judgments, and a new measure is introduced that is both highly correlated with existing measures when complete judgments are available and more robust to incomplete judgment sets. Expand
Overview of the Third Text REtrieval Conference (TREC-3)
TLDR
This conference became the first in a series of ongoing conferences dedicated to encouraging research in retrieval from large-scale test collections, and to encouraging increased interaction among research groups in industry and academia. Expand
...
1
2
3
...