UnifieR: A Unified Retriever for Large-Scale Retrieval

  title={UnifieR: A Unified Retriever for Large-Scale Retrieval},
  author={Tao Shen and Xiubo Geng and Chongyang Tao and Can Xu and Kai Zhang and Daxin Jiang},
Large-scale retrieval is to recall relevant documents from a huge collection given a query. It relies on representation learning to embed documents and queries into a common semantic encoding space. According to the encoding space, recent retrieval methods based on pre-trained language models (PLM) can be coarsely cat-egorized into either dense-vector or lexicon-based paradigms. These two paradigms unveil the PLMs’ representation capability in different granularities, i.e., global sequence… 

Figures and Tables from this paper

Aggretriever: A Simple Approach to Aggregate Textual Representation for Robust Dense Passage Retrieval
This work demonstrates that MLM pre-trained transformers can be used to effectively encode text information into a single-vector for dense retrieval.


SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval
The pooling mechanism is modified, a model solely based on document expansion is benchmarked, and models trained with distillation are introduced, leading to state-of-the-art results on the BEIR benchmark.
PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval
This work proposes a novel approach that leverages both query-centric and PAssage-centric sImilarity Relations (called PAIR) for dense passage retrieval that significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions datasets.
Semantic Models for the First-Stage Retrieval: A Comprehensive Review
The current landscape of the first-stage retrieval models under a unified framework is described to clarify the connection between classical term-based retrieval methods, early semantic retrieved methods, and neural semantic retrieval methods.
Adversarial Retriever-Ranker for dense text retrieval
Experimental results show that AR2 consistently and significantly outperforms existing dense retriever methods and achieves new state-of-the-art results on all of them.
Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach
This paper performs an empirical study, using a publicly available TREC collection, which demonstrates the effectiveness of the proposed hybrid approach and sheds light on the different characteristics of the semantic approach, the lexical approach, and their combination.
Towards Unsupervised Dense Information Retrieval with Contrastive Learning
This work explores the limits of contrastive learning as a way to train unsupervised dense retrievers, and shows that it leads to strong retrieval performance on the BEIR benchmark.
Large Dual Encoders Are Generalizable Retrievers
Experimental results show that the dual encoders, Generalizable T5-based dense Retrievers (GTR), outperform existing sparse and dense retrievers on the BEIR dataset (Thakur et al., 2021) significantly and is very data efficient, as it only needs 10% of MS Marco supervised data to achieve the best out-of-domain performance.
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
This work extensively analyzes different retrieval models and provides several suggestions that it believes may be useful for future work, finding that performing well consistently across all datasets is challenging.
A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques
This work presents a novel technique dubbed “uniCOIL”, a simple extension of COIL that achieves the current state-of-the-art in sparse retrieval on the popular MS MARCO passage ranking dataset.
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction
Maize is introduced, a retriever that couples an aggressive residual compression mechanism with a denoised supervision strategy to simultaneously improve the quality and space footprint of late interaction and establishes state-of-the-art quality within and outside the training domain.