Estimating average precision with incomplete and imperfect judgments

@inproceedings{Yilmaz2006EstimatingAP,
  title={Estimating average precision with incomplete and imperfect judgments},
  author={Emine Yilmaz and J. Aslam},
  booktitle={CIKM '06},
  year={2006}
}
We consider the problem of evaluating retrieval systems using incomplete judgment information. Buckley and Voorhees recently demonstrated that retrieval systems can be efficiently and effectively evaluated using incomplete judgments via the bpref measure [6]. When relevance judgments are complete, the value of bpref is an approximation to the value of average precision using complete judgments. However, when relevance judgments are incomplete, the value of bpref deviates from this value, though… Expand
Estimating average precision when judgments are incomplete
TLDR
Three new evaluation measures induced AP, subcollection AP, and inferred AP are proposed that are equivalent to average precision when the relevance judgments are complete and that are statistical estimates ofaverage precision when relevance judgmentsAre a random subset of complete judgments. Expand
Inferring document relevance from incomplete information
TLDR
This work shows that given estimates of average precision, one can accurately infer the relevances of the remaining unjudged documents, thus obtaining a fully judged pool that can be used in standard ways for system evaluation of all kinds. Expand
Reducing Reliance on Relevance Judgments for System Comparison by Using Expectation-Maximization
TLDR
Experiments using TREC Ad Hoc collections demonstrate strong correlations with system rankings using pooled human judgments, and comparison with existing baselines indicates that the new method achieves the same comparison reliability with fewer human judgments. Expand
On the robustness of relevance measures with incomplete judgments
TLDR
This work investigates the robustness of three widely used IR relevance measures for large data collections with incomplete judgments and shows that NDCG consistently performs better than both bpref and infAP. Expand
Reliable information retrieval evaluation with incomplete and biased judgements
TLDR
This work compares the performance of this method with other approaches to the problem of incomplete judgements, such as bpref, and shows that the proposed method leads to higher evaluation accuracy, especially if the set of manual judgements is rich in documents, but highly biased against some systems. Expand
Evaluation over thousands of queries
TLDR
Investigating tradeoffs between the number of queries and number of judgments shows that, up to a point, evaluation over more queries with fewer judgments is more cost-effective and as reliable as fewer queries with more judgments. Expand
Strategic system comparisons via targeted relevance judgments
TLDR
Using rank-biased precision, a recently proposed effectiveness measure, it is shown that judging around 200 documents for each of 50 queries in a TREC-scale system evaluation containing over 100 runs is sufficient to identify the best systems. Expand
Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments
TLDR
This article reproduces the main results on evaluating information retrieval systems without relevance judgments and generalizes such previous work to analyze the effect of test collections, evaluation metrics, and pool depth, and shows that previous work is overall reproducible and semi-automatic evaluation is an effective methodology. Expand
Retrieval sensitivity under training using different measures
TLDR
Experimental results show that training by bpref, infAP and nDCG provides significantly better retrieval performance than training by MAP when relevance judgements completeness is extremely low, and when relevanceJudgement completeness increases, the measures behave more similarly. Expand
On information retrieval metrics designed for evaluation with incomplete relevance assessments
TLDR
This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs—the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 19 REFERENCES
On the effectiveness of evaluating retrieval systems in the absence of relevance judgments
TLDR
It is demonstrated that evaluating retrieval systems according to average similarity yields results quite similar to the methodology proposed by Soboroff et~al. Expand
Ranking retrieval systems without relevance judgments
TLDR
The initial results of a new evaluation methodology which replaces human relevance judgments with a randomly selected mapping of documents to topics are proposed, which are referred to aspseudo-relevance judgments. Expand
Retrieval evaluation with incomplete information
TLDR
It is shown that current evaluation measures are not robust to substantially incomplete relevance judgments, and a new measure is introduced that is both highly correlated with existing measures when complete judgments are available and more robust to incomplete judgment sets. Expand
Automatic ranking of retrieval systems in imperfect environments
TLDR
The method of simulating imperfect environments can be used for Web search engine assessment and in estimating the effects of network conditions (e.g., network unreliability) on IR system performance. Expand
How reliable are the results of large-scale information retrieval experiments?
TLDR
A detailed empirical investigation of the TREC results shows that the measured relative performance of systems appears to be reliable, but that recall is overestimated: it is likely that many relevant documents have not been found. Expand
A unified model for metasearch, pooling, and system evaluation
TLDR
A unified model is presented which simultaneously solves the problems of fusing the ranked lists of documents in order to obtain a high-quality combined list (metasearch); generating document collections likely to contain large fractions of relevant documents (pooling); and accurately evaluating the underlying retrieval systems with small numbers of relevance judgments (efficient system assessment). Expand
On Collection Size and Retrieval Effectiveness
TLDR
It is empirically confirmed that P@n should decline when moving to a sample collection and that average precision and R-precision should remain constant, and SD theory suggests the use of recall-fallout plots as operating characteristic (OC) curves. Expand
Efficient construction of large test collections
TLDR
This work proposes two methods, Intemctive Searching and Judging and Moveto-front Pooling, that yield effective test collections while requiring many fewer judgements. Expand
The maximum entropy method for analyzing retrieval measures
TLDR
For good measures of overall performance, the corresponding maximum entropy distributions can be used to accurately infer precision-recall curves and the values of other measures of performance, and it is demonstrated that the quality of these inferences far exceeds that predicted by simple retrieval measure correlation, as demonstrated through TREC data. Expand
Evaluation by highly relevant documents
TLDR
To explore the role highly relevant documents play in retrieval system evaluation, assessors for the TREC-9 web track used a three-point relevance scale and also selected best pages for each topic, confirming the hypothesis that different retrieval techniques work better for retrievinghighly relevant documents. Expand
...
1
2
...