Shallow pooling for sparse labels

  title={Shallow pooling for sparse labels},
  author={Negar Arabzadeh and Alexandra Vtyurina and Xinyi Yan and Charles L. A. Clarke},
  journal={Information Retrieval Journal},
  pages={365 - 385}
Recent years have seen enormous gains in core information retrieval tasks, including document and passage ranking. Datasets and leaderboards, and in particular the MS MARCO datasets, illustrate the dramatic improvements achieved by modern neural rankers. When compared with traditional information retrieval test collections, such as those developed by TREC, the MS MARCO datasets employ substantially more queries—thousands vs. dozens – with substantially fewer known relevant items per query—often… 

DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine

The experiments demonstrate that DuReader retrieval is challenging and a number of problems remain unsolved, such as the salient phrase mismatch and the syntactic mismatch between queries and paragraphs.

Human Preferences as Dueling Bandits

This work frames the problem of finding best items as a dueling bandits problem, and simulates selected algorithms on representative test cases to provide insight into their practical utility, suggesting modifications to further improve its performance.

Dense Text Retrieval based on Pretrained Language Models: A Survey

This survey aims to provide a comprehensive, practical reference focused on the major progress for dense text retrieval, and takes a new perspective to organize the related work by four major aspects, including architecture, training, indexing and integration, and summarize the mainstream techniques for each aspect.

Accelerating Learned Sparse Indexes Via Term Impact Decomposition

This paper introduces a technique they call postings clipping to improve the query efficiency of learned representations by accounting for changes in term importance distributions of learned ranking models.

Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models

A novel Coupled Estimation technique is proposed that learns both a relevance model and a selection model simultaneously to correct the pooling bias for training NRMs and shows that NRMs trained with this technique can achieve significant gains on ranking effectiveness against other baseline strategies.

Adaptive Re-Ranking with a Corpus Graph

The Graph-based Adaptive Re-ranking (GAR) approach significantly improves the performance of re-ranking pipelines in terms of precision- and recall-oriented measures, is complementary to a variety of existing techniques, is robust to its hyperparameters, and contributes minimally to computational and storage costs.

Noise-Reduction for Automatically Transferred Relevance Judgments

This work compares the predicted relevance probabilities of monoT5 for the two versions of the judged documents and finds substantial differences, and shows that training a retrieval model on the "wrong" version can reduce the nDCG@10 by up to 75%.

Batch Evaluation Metrics in Information Retrieval: Measures, Scales, and Meaning

It is argued that most current IR metrics are well-founded, and, moreover, that those metrics are more meaningful in their current form than in the proposed “intervalized” versions.

Too Many Relevants: Whither Cranfield Test Collections?

Collection builders will need new strategies and tools for building reliable test collections for continued use of the Cranfield paradigm on ever-larger corpora, and ensuring that the definition of 'relevant' truly reflects the desired systems' rankings is a provisional strategy for continued collection building.



MS MARCO: Benchmarking Ranking Models in the Large-Data Regime

This paper uses the MS MARCO and TREC Deep Learning Track as a case study, comparing it to the case of TREC ad hoc ranking in the 1990s and showing how the design of the evaluation effort can encourage or discourage certain outcomes, and raising questions about internal and external validity of results.

Good Evaluation Measures based on Document Preferences

It is shown that the best of these measures perform at least as well as an average assessor in terms of agreement with users' SERP preferences, and that implicit document preferences play a much more important role than explicit preferences.

Passage Re-ranking with BERT

A simple re-implementation of BERT for query-based passage re-ranking on the TREC-CAR dataset and the top entry in the leaderboard of the MS MARCO passage retrieval task, outperforming the previous state of the art by 27% in MRR@10.


The Deep Learning Track is a new track for TREC 2019, with the goal of studying ad hoc ranking in a large data regime, introducing two sets corresponding to two tasks, each with rigorous TREC-style blind evaluation and reusable test sets.

Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval

RepCONC is a novel retrieval model that learns discrete Representations via CONstrained Clustering and substantially outperforms a wide range of existing retrieval models in terms of retrieval effectiveness, memory efficiency, and time efficiency.

Adaptive Batch Scheduling for Open-Domain Question Answering

The evaluation result shows that the proposed adaptive batch scheduling could significantly improve the document retrieval performances of dual encoder architecture document retrieval systems.

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

Recent research demonstrates the effectiveness of using fine-tuned language models (LM) for dense retrieval. However, dense retrievers are hard to train, typically requiring heavily engineered

Evaluation Measures Based on Preference Graphs

This work proposes an evaluation measure that computes the similarity between a directed multigraph of preferences and an actual ranking generated by a ranker, and employs Rank Biased Overlap which was explicitly created to match the requirements of search and related applications.

Assessing Top- Preferences

The assessment process for partial preference judgments is explored, with the aim of identifying and ordering the top items in the pool, rather than fully ordering the entire pool, to measure the performance of a ranker by applying a rank similarity measure.

Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard

Empirical analysis of SOTA runs from the MS MARCO document ranking leaderboard reveals insights about how one run can be "significantly better" than another that are obscured by the current official evaluation metric (MRR@100).