• Corpus ID: 246285387

Can Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models?

  title={Can Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models?},
  author={Ellen M. Voorhees and Ian Soboroff and Jimmy J. Lin},
Neural retrieval models are generally regarded as fundamentally different from the retrieval techniques used in the late 1990’s when the TREC ad hoc test collections were constructed. They thus provide the opportunity to empirically test the claim that pooling-built test collections can reliably evaluate retrieval systems that did not contribute to the construction of the collection (in other words, that such collections can be reusable). To test the reusability claim, we asked TREC assessors… 

Multi-element protocol on IR experiments stability: Application to the TREC-COVID test collection

This work explores in a systematic way the impact of similarity of test collections on the comparability of the experiments: characterizing the minimal changes between the collections upon which the performance of IR system evaluated can be compared.

TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval

This paper presents the novel, relevance judgement test collection TripJudge for TripClick health retrieval and finds that that click and judgement-based evaluation can lead to substantially different system rankings.

mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark

Robust 2004 is an information retrieval benchmark whose large number of judgments per query make it a reliable evaluation dataset. In this paper, we present mRobust04, a multilingual version of

On Survivorship Bias in MS MARCO

Survivorship bias is the tendency to concentrate on the positive outcomes of a selection process and overlook the results that generate negative outcomes. We observe that this bias could be present

Exposure Gerrymandering: Search Engine Manipulation Flying under Fairness’ Radar

This paper introduces the notion of Exposure Gerrymandering, to illustrate how nefarious actors could create a system that appears unbiased to common fairness assessments, while substantially influencing the election at hand.



On the Reliability of Test Collections for Evaluating Systems of Different Types

Simulated pooling is used to test the fairness and reusability of test collections, showing that especially when shallow pools are used, pooling based on traditional systems only may lead to biased evaluation of deep learning systems.

Comparing Score Aggregation Approaches for Document Retrieval with Pretrained Transformers

This work reproduces three passage score aggregation approaches proposed by Dai and Callan for overcoming the maximum input length limitation of BERT and finds that these BERT variants are not more effective for document retrieval in isolation, but can lead to increased effectiveness when combined with "pre–fine-tuning” on the MS MARCO passage dataset.

Variations in relevance judgments and the measurement of retrieval effectiveness

Very high correlations were found among the rankings of systems produced using diAerent relevance judgment sets, indicating that the comparative evaluation of retrieval performance is stable despite substantial diAerences in relevance judgments, and thus reaArm the use of the TREC collections as laboratory tools.

UMass at TREC 2004: Novelty and HARD

The primary findings for passage retrieval are that document retrieval methods performed better than passage retrieval methods on the passage evaluation metric of binary preference at 12,000 characters, and that clarification forms improved passage retrieval for every retrieval method explored.

Bias and the limits of pooling for large collections

It is shown that the judgment sets produced by traditional pooling when the pools are too small relative to the total document set size can be biased in that they favor relevant documents that contain topic title words.

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

This work extensively analyzes different retrieval models and provides several suggestions that it believes may be useful for future work, finding that performing well consistently across all datasets is challenging.

On Building Fair and Reusable Test Collections using Bandit Techniques

Analysis demonstrates that the greedy approach common to most bandit methods can be unfair even to the runs participating in the collection-building process when the judgment budget is small relative to the (unknown) number of relevant documents.

Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval

This paper is able to leverage passage-level relevance judgments fortuitously available in other domains to fine-tune BERT models that are able to capture cross-domain notions of relevance, and can be directly used for ranking news articles.

Pretrained Transformers for Text Ranking: BERT and Beyond

This tutorial provides an overview of text ranking with neural network architectures known as transformers, of which BERT (Bidirectional Encoder Representations from Transformers) is the best-known example, and covers a wide range of techniques.

The Philosophy of Information Retrieval Evaluation

The fundamental assumptions and appropriate uses of the Cranfield paradigm, especially as they apply in the context of the evaluation conferences, are reviewed.