Time-based calibration of effectiveness measures

@inproceedings{Smucker2012TimebasedCO,
  title={Time-based calibration of effectiveness measures},
  author={Mark D. Smucker and Charles L. A. Clarke},
  booktitle={SIGIR '12},
  year={2012}
}
Many current effectiveness measures incorporate simplifying assumptions about user behavior. These assumptions prevent the measures from reflecting aspects of the search process that directly impact the quality of retrieval results as experienced by the user. In particular, these measures implicitly model users as working down a list of retrieval results, spending equal time assessing each document. In reality, even a careful user, intending to identify as much relevant material as possible… 

Figures and Tables from this paper

Stochastic simulation of time-biased gain
TLDR
Stochastic simulation is used to numerically approximate time-biased gain, a unifying framework for information retrieval evaluation that generalizes many traditional effectiveness measures while accommodating aspects of user behavior not captured by these measures.
Time well spent
TLDR
A new instantiation of time-biased gain is explored, applicable to systems where the user judges the quality of their experience by the amount of time well spent, and which models user variability and produces a distribution of gain on a per-query basis.
Users versus models: what observation tells us about effectiveness metrics
TLDR
The results show that user behavior is influenced by a blend of many factors, including the extent to which relevant documents are encountered, the stage of the search process, and task difficulty, which can be used to guide development of batch effectiveness metrics.
Models and metrics: IR evaluation as a user process
TLDR
This work explores the linkage between models and metrics, considering a range of effectiveness metrics, and the user search behavior that each of them implies, and examines more complex user models, as a guide to the development of new effectiveness metrics.
Evaluating Contextual Suggestion
TLDR
Building on the time-biased gain framework of Smucker and Clarke, which recognizes time as a critical element in user modeling for evaluation, a new evaluation measure is proposed that directly accommodates these factors.
A Flexible Framework for Offline Effectiveness Metrics
TLDR
This work introduces a user behavior framework that extends the C/W/L family, and carries out experiments comparing the patterns of metric scores generated, and showing that those metrics vary quite markedly in terms of their ability to predict user satisfaction.
User Variability and IR System Evaluation
TLDR
This work explores two aspects of user variability with regard to evaluating the relative performance of IR systems, assessing effectiveness in the context of a subset of topics from three TREC collections, with the embodied information needs categorized against three levels of increasing task complexity.
Assessing efficiency–effectiveness tradeoffs in multi-stage retrieval systems without using relevance judgments
TLDR
Methods for measuring the quality of filtering and preliminary ranking stages are examined, and it is shown that this quality score directly correlates with actual differences in measured effectiveness when relevance judgments are available.
Relevance and Effort: An Analysis of Document Utility
TLDR
It is proposed that if the goal is to evaluate the likelihood of utility to the user, effort as well as relevance should be taken into consideration, and possibly characterized independently, when judgments are obtained.
The twist measure for IR evaluation: Taking user's effort into account
TLDR
A novel measure for ranking evaluation, called Twist (τ), is shown to grasp different aspects of system performances, to not require extensive and costly assessments, and to be a robust tool for detecting differences between systems.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 47 REFERENCES
A comparative analysis of cascade measures for novelty and diversity
TLDR
The properties and performance of cascade measures are examined with the goal of validating them as tools for measuring effectiveness, and it is indicated that these measures reward systems that achieve an balance between novelty and overall precision in their result lists, as intended.
A user behavior model for average precision and its generalization to graded judgments
TLDR
A more realistic version of AP is proposed where users click non-deterministically on relevant documents and where the number of relevant documents in the collection needs not be known in advance.
An Analysis of User Strategies for Examining and Processing Ranked Lists of Documents
TLDR
A cluster analysis of the search results processing behavior of 48 user study participants found participants to employ two approaches to the evaluation of summaries: fast and liberal, or slow and neutral.
Rank-biased precision for measurement of retrieval effectiveness
TLDR
A new effectiveness metric, rank-biased precision, is introduced that is derived from a simple model of user behavior, is robust if answer rankings are extended to greater depths, and allows accurate quantification of experimental uncertainty, even when only partial relevance judgments are available.
Simulating simple user behavior for system effectiveness evaluation
TLDR
This work proposes that measures that include a parameterized user model offer an opportunity to more accurately simulate the variance due to user behavior, and thus to analyze system effectiveness to a simulated user population.
Modelling A User Population for Designing Information Retrieval Metrics
TLDR
This paper generalise NCP further and demonstrates that AP and its graded-relevance version Q-measure are in fact reasonable metrics despite the above uniform probability assumption, and emphasise long-tail users who tend to dig deep into the ranked list, and thereby achieve high reliability.
Click-based evidence for decaying weight distributions in search effectiveness metrics
TLDR
A process for extrapolating user observations from query log clickthroughs is described, and this user model is employed to measure the quality of effectiveness weighting distributions, showing that for measures with static distributions, the geometric weighting model employed in the rank-biased precision effectiveness metric offers the closest fit to the user observation model.
Evaluating implicit feedback models using searcher simulations
TLDR
Six different models that base their decisions on the interactions of searchers and use different approaches to rank query modification terms are introduced, to determine which of these models should be used to assist searchers in the systems the authors develop.
Including summaries in system evaluation
TLDR
The popular IR metrics MAP and P@10 are modified to incorporate the summary reading step of the search process, and the effects on system rankings using TREC data are studied.
Cumulated gain-based evaluation of IR techniques
TLDR
This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position, and test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences.
...
1
2
3
4
5
...