Variations in relevance judgments and the measurement of retrieval effectiveness

@article{Voorhees1998VariationsIR,
  title={Variations in relevance judgments and the measurement of retrieval effectiveness},
  author={Ellen M. Voorhees},
  journal={Inf. Process. Manag.},
  year={1998},
  volume={36},
  pages={697-716}
}
  • E. Voorhees
  • Published 1 August 1998
  • Environmental Science
  • Inf. Process. Manag.
Abstract Test collections have traditionally been used by information retrieval researchers to improve their retrieval strategies. To be viable as a laboratory tool, a collection must reliably rank different retrieval variants according to their true effectiveness. In particular, the relative effectiveness of two retrieval strategies should be insensitive to modest changes in the relevant document set since individual relevance assessments are known to vary widely. The test collections… 

Figures and Tables from this paper

Ranking retrieval systems without relevance judgments

The initial results of a new evaluation methodology which replaces human relevance judgments with a randomly selected mapping of documents to topics are proposed, which are referred to aspseudo-relevance judgments.

Incremental test collections

An algorithm that intelligently selects documents to be judged and decides when to stop in such a way that with very little work there can be a high degree of confidence in the result of the evaluation is presented.

Building reliable test and training collections in information retrieval

In the process of building reliable and efficient test and training collections, methods of selecting the appropriate documents and queries to be judged are investigated and evaluation metrics that can better capture the overall effectiveness of the retrieval systems under study are proposed.

Measurement in information retrieval evaluation

This thesis introduces the use of statistical power analysis to the field of retrieval evaluation, finding that most test collections cannot reliably detect incremental improvements in performance and proposes the standardization of scores, based on the observed results of a set of reference systems for each query.

Understanding and Predicting Characteristics of Test Collections in Information Retrieval

It is shown that the reusability of a test collection can be predicted with high accuracy when the same document collection is used for successive years in an evaluation campaign, as is common in TREC.

Quantifying test collection quality based on the consistency of relevance judgements

It is concluded that there is a clear value in examining, even inserting, ground truth data in test collections, and proposed ways to help minimise the sources of inconsistency when creating future test collections.

Minimal test collections for retrieval evaluation

This work links evaluation with test collection construction to gain an understanding of the minimal judging effort that must be done to have high confidence in the outcome of an evaluation.

Using Global Statistics to Rank Retrieval Systems without Relevance Judgments

A novel method using global statistics to rank retrieval systems without relevance judgments, which outperforms all early attempts and is adjustable for different effectiveness measurements, e.g. MAP, precision at n, and so forth.

Information Retrieval Evaluation with Partial Relevance Judgment

This investigation shows that when only partial relevance judgment is available, mean average precision suffers from several drawbacks: inaccurate values, no explicit explanation, and being subject to the evaluation environment.

Ranking Retrieval Systems with Partial Relevance Judgements

Four system-oriented measures, which are mean average precision, recall level precision, normalized discount cumulative gain, and normalized average precision over all documents, are discussed and it is shown that averaging values over a set of queries may not be the most reliable approach to rank a group of retrieval systems.
...

References

SHOWING 1-10 OF 38 REFERENCES

Variations in Relevance Judgments and the Evaluation of Retrieval Performance

Relevance Judgments for Assessing Recall

Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness

It is argued that a series of thorough, rigorous, and extensive tests is needed of precisely how, and under what conditions, variations in relevance assessments do, and do not, affect measures of retrieval performance.

OPENING THE BLACK BOX OF ‘RELEVANCE’

The purpose of this project was to identify variables thought to affect relevance judgments and conduct a series of laboratory studies to determine the effects of these variables on relevance

Automatic indexing

The various indexing theories are covered and analytical as well as experimental results are given to demonstrate their effectiveness.

Relevance and Information Behavior.

The author's treatment of relevance is restricted to relevance as its relates to the behavior of humans in seeking and using information rather tha it relates to the behavior of humans in seeking and

Automatic indexing

The question of whether online retrieval systems can rely entirely on automatic indexing to achieve adequate retrieval effectiveness or whether some manual pre-indexing is still necessary is considered.

Passage-Based Re nement ( MultiText Experiments for TREC-6 )

The MultiText system retrieves passages, rather than entire documents, that are likely to be relevant to a particular topic and estimates probability of relevance from passage length and uses this estimate to construct a compound query which is used to rank the new data using passage length.

Aslib Cranfield research project - Factors determining the performance of indexing systems; Volume 1, Design; Part 2, Appendices

An essential requirement of the project involved cooperation of a large number of research scientists, and the response to the request was most satisfactory, and I acknowledge with thanks the generous assistance of some two hundred scientists.

TREC-5 English and Chinese Retrieval Experiments using PIRCS

Two English automatic ad-hoc runs have been submitted: pircsAAS uses a short and pircsAAL employs long topics. Our new avtf*ildf term weighting was used for short queries. 2-stage retrieval were