Expected reciprocal rank for graded relevance

  title={Expected reciprocal rank for graded relevance},
  author={Olivier Chapelle and Donald Metlzer and Ya Zhang and Pierre Grinspan},
  journal={Proceedings of the 18th ACM conference on Information and knowledge management},
  • O. ChapelleD. Metlzer P. Grinspan
  • Published 2 November 2009
  • Computer Science
  • Proceedings of the 18th ACM conference on Information and knowledge management
While numerous metrics for information retrieval are available in the case of binary relevance, there is only one commonly used metric for graded relevance, namely the Discounted Cumulative Gain (DCG. [] Key Method More precisely, this new metric is defined as the expected reciprocal length of time that the user will take to find a relevant document. This can be seen as an extension of the classical reciprocal rank to the graded relevance case and we call this metric Expected Reciprocal Rank (ERR). We…

Figures and Tables from this paper

Reciprocal Rank Using Web Page Popularity

A new evaluation metric called Reciprocal Rank using Webpage Popularity (RRP) is presented which takes into account not only the document’s relevance judgment, but also its popularity, and as a result correlates better with click metrics than the other evaluation metrics do.

Using preference judgments for novel document retrieval

It is argued that while the user study shows the subtopic model is good, there are many other factors apart from novelty and redundancy that may be influencing user preferences and a new framework is introduced to construct an ideal diversity ranking using only preference judgments, with no explicit subtopic judgments whatsoever.

GAPfm: optimal top-n recommendations for graded relevance domains

This work proposes GAPfm, the Graded Average Precision factor model, which is a latent factor model for top-N recommendation in domains with graded relevance data, and optimizes the Graders Average Precision metric that has been proposed recently for assessing the quality of ranked results lists for graded relevance.

A framework for evaluation and optimization of relevance and novelty-based retrieval

A nugget-based model of utility with a probabilistic model of user behavior leads to a flexible metric that generalizes existing evaluation measures that allows accurate evaluation and optimization of retrieval systems under realistic conditions, and hence allows rapid development and tuning of new algorithms for novelty-based retrieval without the need for user-centric evaluations.

Batch Evaluation Metrics in Information Retrieval: Measures, Scales, and Meaning

It is argued that most current IR metrics are well-founded, and, moreover, that those metrics are more meaningful in their current form than in the proposed “intervalized” versions.

Offline Evaluation by Maximum Similarity to an Ideal Ranking

This work proposes a radical simplification of NDCG to replace it, and proposes rank biased overlap (RBO) to compute this rank similarity, since it was specifically created to address the requirements of rank similarity between search results.

Evaluating Stochastic Rankings with Expected Exposure

A general evaluation methodology based on expected exposure is proposed, allowing a system, in response to a query, to produce a distribution over rankings instead of a single fixed ranking.

Active Evaluation of Ranking Functions Based on Graded Relevance (Extended Abstract)

This work addresses the problem of estimating ranking performance as accurately as possible on a fixed labeling budget by derive cost-optimal sampling distributions for the commonly used performance measures Discounted Cumulative Gain (DCG) and Expected Reciprocal Rank (ERR).

Preference based evaluation measures for novelty and diversity

An evaluation framework that not only can consider implicit factors but also handles differences in user preference due to varying underlying information need is proposed and its measures are validated by comparing it to existing measures.

Rank and relevance in novelty and diversity metrics for recommender systems

A formal framework for the definition of novelty and diversity metrics is presented that unifies and generalizes several state of the art metrics and identifies three essential ground concepts at the roots of noveltyand diversity: choice, discovery and relevance, upon which the framework is built.



Cumulated gain-based evaluation of IR techniques

This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position, and test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences.

Modelling A User Population for Designing Information Retrieval Metrics

This paper generalise NCP further and demonstrates that AP and its graded-relevance version Q-measure are in fact reasonable metrics despite the above uniform probability assumption, and emphasise long-tail users who tend to dig deep into the ranked list, and thereby achieve high reliability.

Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks

A model that leverages the millions of clicks received by web search engines to predict document relevance can predict the relevance score of documents that have not been judged and is general enough to be applicable to algorithmic web search results.

Rank-biased precision for measurement of retrieval effectiveness

A new effectiveness metric, rank-biased precision, is introduced that is derived from a simple model of user behavior, is robust if answer rankings are extended to greater depths, and allows accurate quantification of experimental uncertainty, even when only partial relevance judgments are available.

Binary and graded relevance in IR evaluations--Comparison of the effects on ranking of IR systems

Alternatives to Bpref

It is shown that the application of Q-measure, normalised Discounted Cumulative Gain or Average Precision to condensed lists, obtained by ltering out all unjudged documents from the original ranked lists, is actually a better solution to the incompleteness problem than bpref.

Minimal test collections for retrieval evaluation

This work links evaluation with test collection construction to gain an understanding of the minimal judging effort that must be done to have high confidence in the outcome of an evaluation.

An experimental comparison of click position-bias models

A cascade model, where users view results from top to bottom and leave as soon as they see a worthwhile document, is the best explanation for position bias in early ranks.

Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems

A measure of document retrieval system performance called the “expected search length reduction factor” is defined and compared with indicators, such as precision and recall, that have been suggested

Novelty and diversity in information retrieval evaluation

This paper develops a framework for evaluation that systematically rewards novelty and diversity into a specific evaluation measure, based on cumulative gain, and demonstrates the feasibility of this approach using a test collection based on the TREC question answering track.