Why batch and user evaluations do not give the same results

@inproceedings{Turpin2001WhyBA,
  title={Why batch and user evaluations do not give the same results},
  author={A. Turpin and W. Hersh},
  booktitle={SIGIR '01},
  year={2001}
}
Much system-oriented evaluation of information retrieval systems has used the Cranfield approach based upon queries run against test collections in a batch mode. Some researchers have questioned whether this approach can be applied to the real world, but little data exists for or against that assertion. We have studied this question in the context of the TREC Interactive Track. Previous results demonstrated that improved performance as measured by relevance-based metrics in batch studies did… Expand
User interface effects in past batch versus user experiments
TLDR
The Okapi based system clearly outperforms the basic system on the standard precision and recall measures that are commonly used to compare IR systems in forums such as TREC and SIGIR, and calls into question the appropriateness of relying on measurements of the performance of IR systems obtained in a batch setting using the Cranfield style methodology. Expand
Metric and Relevance Mismatch in Retrieval Evaluation
TLDR
This paper investigates relevance mismatch, classifying users based on relevance profiles, the likelihood with which they will judge documents of different relevance levels to be useful, and finds that this classification scheme can offer further insight into the transferability of batch results to real user search tasks. Expand
User performance versus precision measures for simple search tasks
TLDR
This study evaluates two different information retrieval tasks on TREC Web-track data: a precision-based user task, measured by the length of time that users need to find a single document that is relevant to a TREC topic; and, a simple recall-based task, represented by the total number of relevant documents that users can identify within five minutes. Expand
Including summaries in system evaluation
TLDR
The popular IR metrics MAP and P@10 are modified to incorporate the summary reading step of the search process, and the effects on system rankings using TREC data are studied. Expand
Comparing System Evaluation with User Experiments for Japanese Web Navigational Retrieval
We conducted a search experiment targeting 31 users to investigate whether the performance evaluation metrics of IR systems used in test collections, such as TREC and NTCIR, are comparable to theExpand
On Obtaining Effort Based Judgements for Information Retrieval
TLDR
This work shows that it is possible to get judgements of effort from the assessors and shows that given documents of the same relevance grade, effort needed to find the portion of the document relevant to the query is a significant factor in determining user satisfaction as well as user preference between these documents. Expand
Studies on Relevance, Ranking and Results Display
TLDR
It is posited that users make such judgments in limited time, and that time optimization per task might help explain some of the findings. Expand
Test Collection-Based IR Evaluation Needs Extension toward Sessions - A Case of Extremely Short Queries
TLDR
The experimental results show that, surprisingly, web-like very short queries typically lead to good enough results even in a TREC type test collection, which motivates the observed real user behavior. Expand
Simulating Search Sessions in Interactive Information Retrieval Evaluation
TLDR
The findings indicate that this approach of applying RF is significantly more effective than PRF with short (title) queries and long (title and description) queries. Expand
The good and the bad system: does the test collection predict users' effectiveness?
TLDR
It is shown that users behave differently and discern differences between pairs of systems that have a very small absolute difference in test collection effectiveness, confirming that users' effectiveness can be predicted successfully. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 20 REFERENCES
Do batch and user evaluations give the same results?
TLDR
The results showed the weighting scheme giving beneficial results in batch studies did not do so with real users, and other factors predictive of instance recall, including number of documents saved by the user, document recall, and number of papers seen by the users were identified. Expand
Further Analysis of Whether Batch and User Evaluations Give the Same Results with a Question-Answering Task
TLDR
This year's experiments using the new questionanswering task adopted in the TREC-9 Interactive Track show that better performance in batch searching evaluation does not translate into gains for real users. Expand
Evaluating Evaluation Measure Stability
TLDR
A novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments is presented, which validates several of the rules-of-thumb experimenters use and challenges other beliefs, such as the common evaluation measures are equally reliable. Expand
The TREC-9 Interactive Track Report
TLDR
This report summarizes the shared experimental framework, which for TREC-9 was designed to support analysis and comparison of system performance only within sites. Expand
ASLIB CRANFIELD RESEARCH PROJECT FACTORS DETERMINING THE PERFORMANCE OF INDEXING SYSTEMS VOLUME 2
Bedford SUMMARY The test results are presented for a number of different index languages using various devices which affect recall or precision. Within the environment of this test, it is shown thatExpand
Aslib Cranfield research project - Factors determining the performance of indexing systems; Volume 1, Design; Part 2, Appendices
TLDR
An essential requirement of the project involved cooperation of a large number of research scientists, and the response to the request was most satisfactory, and I acknowledge with thanks the generous assistance of some two hundred scientists. Expand
Overview of the first TREC conference
TLDR
There was a large variety of retrieval techniques reported on, including methods using automatic thesaurii, sophisticated term weighting, natural language techniques, relevance feedback, and advanced pattern matching. Expand
Information Retrieval as a Trial-And-Error Process
TLDR
This paper examines three important and well-known information retrieval experiments, with a focus on certain internal inconsistencies and on the high variability of search results. Expand
Relevance and Retrieval Evaluation: Perspectives from Medicine
  • W. Hersh
  • Computer Science
  • J. Am. Soc. Inf. Sci.
  • 1994
TLDR
An iterative model of retrieval evaluation is proposed, starting first with the use of topical relevance to insure documents on the subject can be retrieved, followed by theUse of situational relevance to show the user can interact positively with the system. Expand
Term-Weighting Approaches in Automatic Text Retrieval
TLDR
This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared. Expand
...
1
2
...