Bias and the limits of pooling for large collections

  title={Bias and the limits of pooling for large collections},
  author={Chris Buckley and Darrin L. Dimmick and Ian Soboroff and Ellen M. Voorhees},
  journal={Information Retrieval},
Modern retrieval test collections are built through a process called pooling in which only a sample of the entire document set is judged for each topic. The idea behind pooling is to find enough relevant documents such that when unjudged documents are assumed to be nonrelevant the resulting judgment set is sufficiently complete and unbiased. Yet a constant-size pool represents an increasingly small percentage of the document set as document sets grow larger, and at some point the assumption of… 

Score adjustment for correction of pooling bias

This paper proposes to estimate the degree of bias against an unpooled system, and to adjust the system's score accordingly, and demonstrates using resampling experiments on TREC test sets that this method leads to a marked reduction in error.

The wisdom of the rankers: a cost-effective method for building pooled test collections without participant systems

A simple method for building pooled collections when such restrictions exist is presented and it is shown that researchers may use it to produce high-quality collections on the absence of participant systems.

On the Robustness of Information Retrieval Metrics to Biased Relevance Assessments

  • T. Sakai
  • Computer Science
    J. Inf. Process.
  • 2009
It is shown that the condensed-list versions of Average Precision, Q-measure and normalised Discounted Cumulative Gain, which are denote as AP', Q' and nDCG', are not necessarily superior to the original metrics for handling biases, but generally superior to bpref, Rank-Biased Precision and its condensed- list version even in the presence of biases.

Modeling Relevance as a Function of Retrieval Rank

This work investigates the relationship between relevance likelihood and retrieval rank, seeking to identify plausible methods for estimating document relevance and hence computing an inferred gain.

Revisiting the relationship between document length and relevance

A deeper analysis of document length and relevance taking into account that test collections are incomplete indicates that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.

Feeling lucky?: multi-armed bandits for ordering judgements in pooling-based evaluation

It is shown that simple instantiations of multi-armed bandit models are superior to all previous adjudication strategies and leads to theoretically grounded adjudication Strategies that improve over the state of the art.

Using Topic Models to Assess Document Relevance in Exploratory Search User Studies

This paper proposes an approach based on topic modeling that can greatly accelerate document relevance judgment of an entire document collection with an expert assessor needing to mark only a small subset of documents from a given collection.

Incomplete Judgments Metrics Non-Relevant Condense Infer Relevance Residual Score Adjustment System Bias Limited Pooling Budget

This work investigates the relationship between relevance likelihood and retrieval rank, seeking to identify plausible methods for estimating document relevance and hence computing an inferred gain.

Extending test collection pools without manual runs

By combining a simple voting approach with machine learning from documents retrieved by automatic runs, this work is able to identify a large portion of relevant documents that would normally only be found through manual runs.

Exploiting Pooling Methods for Building Datasets for Novel Tasks

This article proposes the design of a system for building test collections easily and cheaply by implementing state-of-the-art pooling strategies and simulating competition participants with different retrieval models and query variants.



Efficient construction of large test collections

This work proposes two methods, Intemctive Searching and Judging and Moveto-front Pooling, that yield effective test collections while requiring many fewer judgements.

Extreme value theory applied to document retrieval from large collections

An analysis of text retrieval applications that assign query-specific relevance scores to documents drawn from particular collections shows that while P(K) typically will increase with collection size, the phenomenon is not universal and depends on the score distributions and relative proportions of relevant and irrelevant documents in the collection.

Minimal test collections for retrieval evaluation

This work links evaluation with test collection construction to gain an understanding of the minimal judging effort that must be done to have high confidence in the outcome of an evaluation.

On Collection Size and Retrieval Effectiveness

It is empirically confirmed that P@n should decline when moving to a sample collection and that average precision and R-precision should remain constant, and SD theory suggests the use of recall-fallout plots as operating characteristic (OC) curves.

Building a filtering test collection for TREC 2002

This work constructed an entirely new set of search topics for the Reuters Corpus for measuring filtering systems, and found that systems performed very differently on the category topics than on the assessor-built topics.

Retrieval evaluation with incomplete information

It is shown that current evaluation measures are not robust to substantially incomplete relevance judgments, and a new measure is introduced that is both highly correlated with existing measures when complete judgments are available and more robust to incomplete judgment sets.

HARD Track Overview in TREC 2003: High Accuracy Retrieval from Documents

Abstract : The High Accuracy Retrieval from Documents (HARD) track explores methods for improving the accuracy of document retrieval systems. It does so by considering three questions. Can additional

Forming test collections with no system pooling

Three different ways of building test collections where no system pooling is used are explored: a collection formation technique combining manual feedback and multiple systems, an existing method based on pooling the output of multiple manual searches, and a new approach where the ranked output of a single automatic search on a single retrieval system is assessed for relevance.

A statistical method for system evaluation using incomplete judgments

This work considers the problem of large-scale retrieval evaluation, and proposes a statistical method for evaluating retrieval systems using incomplete judgments based on random sampling, which produces unbiased estimates of the standard measures themselves.

The effect of topic set size on retrieval experiment error

Using TREC results to empirically derive error rates based on the number of topics used in a test and the observed difference in the average scores indicates researchers need to take care when concluding one method is better than another, especially if few topics are used.