Cumulated gain-based evaluation of IR techniques

  title={Cumulated gain-based evaluation of IR techniques},
  author={Kalervo J{\"a}rvelin and Jaana Kek{\"a}l{\"a}inen},
  journal={ACM Trans. Inf. Syst.},
Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop IR techniques in this direction, it is necessary to develop evaluation approaches and methods that credit IR methods for their ability to retrieve highly relevant documents. This can be done by extending traditional evaluation methods, that is… 

Figures and Tables from this paper

Evaluating information retrieval system performance based on user preference
It is shown that the notion of user preference is general and flexible for formally defining and interpreting multi-grade relevance and gives higher credits to systems for their ability to retrieve highly relevant documents.
Modeling Relevance as a Function of Retrieval Rank
This work investigates the relationship between relevance likelihood and retrieval rank, seeking to identify plausible methods for estimating document relevance and hence computing an inferred gain.
On identifying representative relevant documents
This work estimates the extent to which a relevant document can effectively help in finding (other) relevant documents using some relevance-feedback method employed over the corpus, and presents various representativeness estimates.
Visual Comparison of Ranked Result Cumulated Gains
A Visual Analytics (VA) environment is presented that allows for visually exploring the ranked retrieval results, pointing out the search failures and providing useful insights for improving the underlying IRS ranking algorithm.
Evaluating Information Retrieval System Performance Based on Multi-grade Relevance
Ten existing evaluation methods based on multi-grade relevance are reviewed and it is found that the normalized distance performance measure is the best choice in terms of the sensitivity to document rank order and giving higher credits to systems for their ability of retrieving highly relevant documents.
Cheap IR Evaluation: Fewer Topics, No Relevance Judgements, and Crowdsourced Assessments
The aim of this work is to develop and improve some state-of-the-art work on the evaluation of IR effectiveness while saving resources, and propose a novel, more principled and engineered, overall approach to test collection based effectiveness evaluation.
Pooling-based continuous evaluation of information retrieval systems
This paper proposes a new IR evaluation methodology based on pooled test-collections and on the continuous use of either crowdsourcing or professional editors to obtain relevance judgements, and proposes two metrics: Fairness Score, and opportunistic number of relevant documents, which are used to define new pooling strategies.
On information retrieval metrics designed for evaluation with incomplete relevance assessments
This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs—the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task.
On the Properties of Evaluation Metrics for Finding One Highly Relevant Document
It is concluded that P(+)-measure and O-measure, each modelling a different user behaviour, are the most useful evaluation metrics for the task of finding one highly relevant document.


IR evaluation methods for retrieving highly relevant documents
The novel evaluation methods and the case demonstrate that non-dichotomous relevance assessments are applicable in IR experiments, may reveal interesting phenomena, and allow harder testing of IR methods.
Evaluation by highly relevant documents
To explore the role highly relevant documents play in retrieval system evaluation, assessors for the TREC-9 web track used a three-point relevance scale and also selected best pages for each topic, confirming the hypothesis that different retrieval techniques work better for retrievinghighly relevant documents.
Measures of relative relevance and ranked half-life: performance indicators for interactive IR
The RR measure describes the degree of agreement between the types of relevance applied in evaluation of information retrieval (IR) systems in a non-binary assessment context and has potential to bridge the gap between subjective and objective relevance.
Liberal relevance criteria of TREC -: counting on negligible documents?
A four-point relevance scale is introduced and the findings of a project in which TREC-7 andTREC-8 document pools on 38 topics were reassessed are reported, finding that about 50% of documents assessed as relevant were regarded as marginal.
The rationale of evaluating the IR algorithms, the status of the traditional evaluation, and the applicability of the proposed novel evaluation methods and measures are examined.
Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems
A measure of document retrieval system performance called the “expected search length reduction factor” is defined and compared with indicators, such as precision and recall, that have been suggested
How reliable are the results of large-scale information retrieval experiments?
A detailed empirical investigation of the TREC results shows that the measured relative performance of systems appears to be reliable, but that recall is overestimated: it is likely that many relevant documents have not been found.
Generic summaries for indexing in information retrieval
This paper examines the use of generic summaries for indexing in information retrieval. Our main observations are that: (1) With or without pseudo-relevance feedback, a summary index may be as