Why batch and user evaluations do not give the same results

  title={Why batch and user evaluations do not give the same results},
  author={Andrew Turpin and William R. Hersh},
  booktitle={Annual International ACM SIGIR Conference on Research and Development in Information Retrieval},
  • A. TurpinW. Hersh
  • Published in
    Annual International ACM…
    1 September 2001
  • Computer Science
Much system-oriented evaluation of information retrieval systems has used the Cranfield approach based upon queries run against test collections in a batch mode. Some researchers have questioned whether this approach can be applied to the real world, but little data exists for or against that assertion. We have studied this question in the context of the TREC Interactive Track. Previous results demonstrated that improved performance as measured by relevance-based metrics in batch studies did… 

Tables from this paper

User interface effects in past batch versus user experiments

The Okapi based system clearly outperforms the basic system on the standard precision and recall measures that are commonly used to compare IR systems in forums such as TREC and SIGIR, and calls into question the appropriateness of relying on measurements of the performance of IR systems obtained in a batch setting using the Cranfield style methodology.

Metric and Relevance Mismatch in Retrieval Evaluation

This paper investigates relevance mismatch, classifying users based on relevance profiles, the likelihood with which they will judge documents of different relevance levels to be useful, and finds that this classification scheme can offer further insight into the transferability of batch results to real user search tasks.

User performance versus precision measures for simple search tasks

This study evaluates two different information retrieval tasks on TREC Web-track data: a precision-based user task, measured by the length of time that users need to find a single document that is relevant to a TREC topic; and, a simple recall-based task, represented by the total number of relevant documents that users can identify within five minutes.

Including summaries in system evaluation

The popular IR metrics MAP and P@10 are modified to incorporate the summary reading step of the search process, and the effects on system rankings using TREC data are studied.

Comparing System Evaluation with User Experiments for Japanese Web Navigational Retrieval

A search experiment targeting 31 users investigated whether the performance evaluation metrics of IR systems used in test collections, such as TREC and NTCIR, are comparable to the user performance and subjective evaluation, and showed no significant differences between these systems and topics.

On Obtaining Effort Based Judgements for Information Retrieval

This work shows that it is possible to get judgements of effort from the assessors and shows that given documents of the same relevance grade, effort needed to find the portion of the document relevant to the query is a significant factor in determining user satisfaction as well as user preference between these documents.

Studies on Relevance, Ranking and Results Display

It is posited that users make such judgments in limited time, and that time optimization per task might help explain some of the findings.

Test Collection-Based IR Evaluation Needs Extension toward Sessions - A Case of Extremely Short Queries

The experimental results show that, surprisingly, web-like very short queries typically lead to good enough results even in a TREC type test collection, which motivates the observed real user behavior.

Simulating Search Sessions in Interactive Information Retrieval Evaluation

The findings indicate that this approach of applying RF is significantly more effective than PRF with short (title) queries and long (title and description) queries.

The good and the bad system: does the test collection predict users' effectiveness?

It is shown that users behave differently and discern differences between pairs of systems that have a very small absolute difference in test collection effectiveness, confirming that users' effectiveness can be predicted successfully.



Do batch and user evaluations give the same results?

The results showed the weighting scheme giving beneficial results in batch studies did not do so with real users, and other factors predictive of instance recall, including number of documents saved by the user, document recall, and number of papers seen by the users were identified.

Further Analysis of Whether Batch and User Evaluations Give the Same Results with a Question-Answering Task

This year's experiments using the new questionanswering task adopted in the TREC-9 Interactive Track show that better performance in batch searching evaluation does not translate into gains for real users.

The TREC-9 Interactive Track Report

This report summarizes the shared experimental framework, which for TREC-9 was designed to support analysis and comparison of system performance only within sites.


The detailed analysis of the reasons for failure to retrieve relevant documents or for the retrieval of non-relevant documents was an important part of Cranfield II.

Aslib Cranfield research project - Factors determining the performance of indexing systems; Volume 1, Design; Part 2, Appendices

An essential requirement of the project involved cooperation of a large number of research scientists, and the response to the request was most satisfactory, and I acknowledge with thanks the generous assistance of some two hundred scientists.

Overview of the first TREC conference

There was a large variety of retrieval techniques reported on, including methods using automatic thesaurii, sophisticated term weighting, natural language techniques, relevance feedback, and advanced pattern matching.

Information Retrieval as a Trial-And-Error Process

This paper examines three important and well-known information retrieval experiments, with a focus on certain internal inconsistencies and on the high variability of search results.

Relevance and Retrieval Evaluation: Perspectives from Medicine

  • W. Hersh
  • Computer Science
    J. Am. Soc. Inf. Sci.
  • 1994
An iterative model of retrieval evaluation is proposed, starting first with the use of topical relevance to insure documents on the subject can be retrieved, followed by theUse of situational relevance to show the user can interact positively with the system.

Term-Weighting Approaches in Automatic Text Retrieval

Pivoted document length normalization

Pivoted normalization is presented, a technique that can be used to modify any normalization function thereby reducing the gap between the relevance and the retrieval probabilities, and two new normalization functions are presented–-pivoted unique normalization and piuotert byte size nornaahzation.