Cumulated gain-based evaluation of IR techniques

@article{Jrvelin2002CumulatedGE,
  title={Cumulated gain-based evaluation of IR techniques},
  author={Kalervo J{\"a}rvelin and Jaana Kek{\"a}l{\"a}inen},
  journal={ACM Trans. Inf. Syst.},
  year={2002},
  volume={20},
  pages={422-446}
}
Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop IR techniques in this direction, it is necessary to develop evaluation approaches and methods that credit IR methods for their ability to retrieve highly relevant documents. This can be done by extending traditional evaluation methods, that is… 

Figures and Tables from this paper

Test Collections and Evaluation Metrics Based on Graded Relevance
In modern large information retrieval (IR) environments, the number of documents relevant to a request may easily exceed the number of documents a user is willing to examine. Therefore it is
Evaluating information retrieval system performance based on user preference
TLDR
It is shown that the notion of user preference is general and flexible for formally defining and interpreting multi-grade relevance and gives higher credits to systems for their ability to retrieve highly relevant documents.
Modeling Relevance as a Function of Retrieval Rank
TLDR
This work investigates the relationship between relevance likelihood and retrieval rank, seeking to identify plausible methods for estimating document relevance and hence computing an inferred gain.
On identifying representative relevant documents
TLDR
This work estimates the extent to which a relevant document can effectively help in finding (other) relevant documents using some relevance-feedback method employed over the corpus, and presents various representativeness estimates.
Visual Comparison of Ranked Result Cumulated Gains
TLDR
A Visual Analytics (VA) environment is presented that allows for visually exploring the ranked retrieval results, pointing out the search failures and providing useful insights for improving the underlying IRS ranking algorithm.
Evaluating Information Retrieval System Performance Based on Multi-grade Relevance
TLDR
Ten existing evaluation methods based on multi-grade relevance are reviewed and it is found that the normalized distance performance measure is the best choice in terms of the sensitivity to document rank order and giving higher credits to systems for their ability of retrieving highly relevant documents.
Cheap IR Evaluation: Fewer Topics, No Relevance Judgements, and Crowdsourced Assessments
TLDR
The aim of this work is to develop and improve some state-of-the-art work on the evaluation of IR effectiveness while saving resources, and propose a novel, more principled and engineered, overall approach to test collection based effectiveness evaluation.
Pooling-based continuous evaluation of information retrieval systems
TLDR
This paper proposes a new IR evaluation methodology based on pooled test-collections and on the continuous use of either crowdsourcing or professional editors to obtain relevance judgements, and proposes two metrics: Fairness Score, and opportunistic number of relevant documents, which are used to define new pooling strategies.
Metric and Relevance Mismatch in Retrieval Evaluation
TLDR
This paper investigates relevance mismatch, classifying users based on relevance profiles, the likelihood with which they will judge documents of different relevance levels to be useful, and finds that this classification scheme can offer further insight into the transferability of batch results to real user search tasks.
...
...

References

SHOWING 1-10 OF 53 REFERENCES
IR evaluation methods for retrieving highly relevant documents
TLDR
The novel evaluation methods and the case demonstrate that non-dichotomous relevance assessments are applicable in IR experiments, may reveal interesting phenomena, and allow harder testing of IR methods.
Using graded relevance assessments in IR evaluation
TLDR
It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents, and a novel application of P-R curves and average precision computations based on separate recall bases for documents of different degrees of relevance is proposed.
Evaluation by highly relevant documents
TLDR
To explore the role highly relevant documents play in retrieval system evaluation, assessors for the TREC-9 web track used a three-point relevance scale and also selected best pages for each topic, confirming the hypothesis that different retrieval techniques work better for retrievinghighly relevant documents.
Measures of relative relevance and ranked half-life: performance indicators for interactive IR
TLDR
The RR measure describes the degree of agreement between the types of relevance applied in evaluation of information retrieval (IR) systems in a non-binary assessment context and has potential to bridge the gap between subjective and objective relevance.
Liberal relevance criteria of TREC -: counting on negligible documents?
TLDR
A four-point relevance scale is introduced and the findings of a project in which TREC-7 andTREC-8 document pools on 38 topics were reassessed are reported, finding that about 50% of documents assessed as relevant were regarded as marginal.
EVALUATING INFORMATION RETRIEVAL SYSTEMS UNDER THE CHALLENGES OF INTERACTION AND MULTIDIMENSIONAL DYNAMIC RELEVANCE
TLDR
The rationale of evaluating the IR algorithms, the status of the traditional evaluation, and the applicability of the proposed novel evaluation methods and measures are examined.
Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems
A measure of document retrieval system performance called the “expected search length reduction factor” is defined and compared with indicators, such as precision and recall, that have been suggested
Ranking in Principle
TLDR
This paper explores the possibility of combining the two principles of ranking principles, but concludes that while neither is adequate alone, nor can any single all‐embracing ranking principle be constructed to replace the two.
How reliable are the results of large-scale information retrieval experiments?
TLDR
A detailed empirical investigation of the TREC results shows that the measured relative performance of systems appears to be reliable, but that recall is overestimated: it is likely that many relevant documents have not been found.
...
...