User performance versus precision measures for simple search tasks

@article{Turpin2006UserPV,
  title={User performance versus precision measures for simple search tasks},
  author={Andrew Turpin and Falk Scholer},
  journal={Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval},
  year={2006}
}
  • A. Turpin, F. Scholer
  • Published 6 August 2006
  • Computer Science
  • Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Several recent studies have demonstrated that the type of improvements in information retrieval system effectiveness reported in forums such as SIGIR and TREC do not translate into a benefit for users. Two of the studies used an instance recall task, and a third used a question answering task, so perhaps it is unsurprising that the precision based measures of IR system effectiveness on one-shot query evaluation do not correlate with user performance on these tasks. In this study, we evaluate… 
A comparison of evaluation measures given how users perform on search tasks
TLDR
This paper investigates the relationship between various retrieval metrics, and considers how these reflect user search performance, and suggests that there are two distinct categories of measures: those that focus on high precision in an answer list, and those that attempt to capture a broader summary, for example by including a recall component.
Predicting query performance for user-based search tasks
TLDR
The preliminary results show that the performance of the predictors differs strongly when using system-based compared to user-based performance measures: predictors that are significantly correlated with one measurement are often not correlated with the other.
Comparing the sensitivity of information retrieval metrics
TLDR
This work studies interleaving in more detail, comparing it with traditional measures in terms of reliability, sensitivity and agreement, and presents some new forms of analysis, including an approach to enhance interleaves sensitivity.
Including summaries in system evaluation
TLDR
The popular IR metrics MAP and P@10 are modified to incorporate the summary reading step of the search process, and the effects on system rankings using TREC data are studied.
On the Properties of Evaluation Metrics for Finding One Highly Relevant Document
TLDR
It is concluded that P(+)-measure and O-measure, each modelling a different user behaviour, are the most useful evaluation metrics for the task of finding one highly relevant document.
Comparing System Evaluation with User Experiments for Japanese Web Navigational Retrieval
We conducted a search experiment targeting 31 users to investigate whether the performance evaluation metrics of IR systems used in test collections, such as TREC and NTCIR, are comparable to the
Models and metrics: IR evaluation as a user process
TLDR
This work explores the linkage between models and metrics, considering a range of effectiveness metrics, and the user search behavior that each of them implies, and examines more complex user models, as a guide to the development of new effectiveness metrics.
Measurement in information retrieval evaluation
TLDR
This thesis introduces the use of statistical power analysis to the field of retrieval evaluation, finding that most test collections cannot reliably detect incremental improvements in performance and proposes the standardization of scores, based on the observed results of a set of reference systems for each query.
A comparison of user and system query performance predictions
TLDR
Comparing the performance ratings users assign to queries with the performance scores estimated by a range of pre-retrieval and post- retrieval query performance predictors suggests that such methods are not representative of how users actually rate query suggestions and topics.
Metric and Relevance Mismatch in Retrieval Evaluation
TLDR
This paper investigates relevance mismatch, classifying users based on relevance profiles, the likelihood with which they will judge documents of different relevance levels to be useful, and finds that this classification scheme can offer further insight into the transferability of batch results to real user search tasks.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 28 REFERENCES
When will information retrieval be "good enough"?
TLDR
It is found that as system accuracy improves, subject time on task and error rate decrease, and the rate of finding new correct answers increases, suggesting that there is some threshold of accuracy for this task beyond which user utility improves rapidly.
Why batch and user evaluations do not give the same results
TLDR
Assessment of the TREC Interactive Track showed that while the queries entered by real users into systems yielding better results in batch studies gave comparable gains in ranking of relevant documents for those users, they did not translate into better performance on specific tasks.
Engineering a multi-purpose test collection for Web retrieval experiments
TLDR
It is confirmed that WT10g contains exploitable link information using a site (homepage) finding experiment and the results show that, on this task, Okapi BM25 works better on propagated link anchor text than on full text.
Advantages of query biased summaries in information retrieval
TLDR
Investigation into the utility of document summarisation in the context of information retrieval and the application of so called query biased (or user directed) summaries indicate that the use of query biased summaries significantly improves both the accuracy and speed of user relevance judgements.
Variations in relevance judgments and the measurement of retrieval effectiveness
TLDR
Very high correlations were found among the rankings of systems produced using different relevance judgment sets, indicating that the comparative evaluation of retrieval performance is stable despite substantial differences in relevance judgments, and thus reaffirm the use of the TREC collections as laboratory tools.
Variations in relevance judgments and the measurement of retrieval effectiveness
TLDR
Very high correlations were found among the rankings of systems produced using diAerent relevance judgment sets, indicating that the comparative evaluation of retrieval performance is stable despite substantial diAerences in relevance judgments, and thus reaArm the use of the TREC collections as laboratory tools.
The TREC-9 Interactive Track Report
TLDR
This report summarizes the shared experimental framework, which for TREC-9 was designed to support analysis and comparison of system performance only within sites.
Overview of the TREC 2003 Web Track
TLDR
Studies conducted by the two participating groups compared a search engine using automatic topic distillation features with the same engine with those features disabled in order to determine whether the automatic topic Distillation features assisted the users in the performance of their tasks and whether humans could achieve better results than the automatic system.
Do batch and user evaluations give the same results?
TLDR
The results showed the weighting scheme giving beneficial results in batch studies did not do so with real users, and other factors predictive of instance recall, including number of documents saved by the user, document recall, and number of papers seen by the users were identified.
Order effects: A study of the possible influence of presentation order on user judgments of document relevance
TLDR
This article describes an effort to study whether the order of document presentation to judges influences the relevance scores assigned to those documents, and it was found that the judgments were influenced by the orderof document presentation.
...
1
2
3
...