• Corpus ID: 220250241

# RepBERT: Contextualized Text Embeddings for First-Stage Retrieval

@article{Zhan2020RepBERTCT,
title={RepBERT: Contextualized Text Embeddings for First-Stage Retrieval},
author={Jingtao Zhan and Jiaxin Mao and Yiqun Liu and Min Zhang and Shaoping Ma},
journal={ArXiv},
year={2020},
volume={abs/2006.15498}
}
• Published 28 June 2020
• Computer Science
• ArXiv
Although exact term match between queries and documents is the dominant method to perform first-stage retrieval, we propose a different approach, called RepBERT, to represent documents and queries with fixed-length contextualized embeddings. The inner products of query and document embeddings are regarded as relevance scores. On MS MARCO Passage Ranking task, RepBERT achieves state-of-the-art results among all initial retrieval techniques. And its efficiency is comparable to bag-of-words…
56 Citations

## Figures and Tables from this paper

### Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models

• Computer Science
• 2022
A novel Coupled Estimation technique is proposed that learns both a relevance model and a selection model simultaneously to correct the pooling bias for training NRMs and shows that NRMs trained with this technique can achieve significant gains on ranking effectiveness against other baseline strategies.

### RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking

• Computer Science
EMNLP
• 2021
A novel joint training approach for dense passage retrieval and passage reranking is proposed, where the dynamic listwise distillation is introduced, where a unified listwise training approach is designed for both the retriever and the re-ranker.

### UHD-BERT: Bucketed Ultra-High Dimensional Sparse Representations for Full Ranking

• Computer Science
ArXiv
• 2021
This model, UHD-BERT, maximizes the benefits of ultrahigh dimensional (UHD) sparse representations based on BERT language modeling, by adopting a bucketing method, enabling us to build a powerful and efficient neuro-symbolic information retrieval system.

### Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval

• Computer Science
EMNLP
• 2021
An ultra-high dimensional (UHD) representation scheme equipped with directly controllable sparsity and a bucketing method, where the embeddings from multiple layers of BERT are selected/merged to represent diverse linguistic aspects, which outperforms other sparse models.

### Learning Discrete Representations via Constrained Clustering for Effective and Efﬁcient Dense Retrieval (Extended Abstract)

• Computer Science
• 2022
RepCONC is a novel retrieval model that learns discrete Representations via CONstrained Clustering as a constrained clustering process, which requires the document embeddings to be uniformly clustered around the quantization centroids and theoretically demonstrate that the uniform clustering constraint facilitates representation distinguishability.

### Isotropic Representation Can Improve Dense Retrieval

• Computer Science
• 2022
This work first shows that BERT-based DR also follows an anisotropic distribution, and introduces unsupervised post-processing methods of Normalizing Flow and whitening, and develops token-wise method in addition to the sequence- wise method for applying the post- processing methods to the representations of dense retrieval models to effectively enhance the representations to be isotropic.

### Ultron: An Ultimate Retriever on Corpus with a Model-based Indexer

• Computer Science
• 2022
This work proposes Ultron, which encodes the knowledge of all documents into the model and aims to retrieve relevant documents end-to-end, and proposes a three-stage training work to capture more knowledge contained in the corpus and associations between queries and docids.

### An Encoder Attribution Analysis for Dense Passage Retriever in Open-Domain Question Answering

• Computer Science
TRUSTNLP
• 2022
It is found that the passage encoder contributes more than the question encoder to in-domain retrieval accuracy, and a probabilistic framework called encoder marginalization is formulated, where the contribution of a single encoder is quantified by marginalizing other variables.

### MS MARCO Chameleons: Challenging the MS MARCO Leaderboard with Extremely Obstinate Queries

• Computer Science
CIKM
• 2021
It is proposed that a well-rounded evaluation strategy for any new ranker would need to include performance measures on both the overall MS MAR CO dataset as well as the proposed MS MARCO Chameleon datasets.

### Predicting Efficiency/Effectiveness Trade-offs for Dense vs. Sparse Retrieval Strategy Selection

• Computer Science, Environmental Science
CIKM
• 2021
This work proposes a classifier to select a suitable retrieval strategy (i.e., sparse vs. dense vs. hybrid) for individual queries, and conducts experiments demonstrating an improved range of efficiency/effectiveness trade-offs between purely sparse, purely dense or hybrid retrieval strategies.

## References

SHOWING 1-10 OF 22 REFERENCES

### Passage Re-ranking with BERT

• Computer Science
ArXiv
• 2019
A simple re-implementation of BERT for query-based passage re-ranking on the TREC-CAR dataset and the top entry in the leaderboard of the MS MARCO passage retrieval task, outperforming the previous state of the art by 27% in MRR@10.

### MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

• Computer Science
CoCo@NIPS
• 2016
This new dataset is aimed to overcome a number of well-known weaknesses of previous publicly available datasets for the same task of reading comprehension and question answering, and is the most comprehensive real-world dataset of its kind in both quantity and quality.

### Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval

• Computer Science
ArXiv
• 2019
A Deep Contextualized Term Weighting framework that learns to map BERT's contextualized text representations to context-aware term weights for sentences and passages to improve the accuracy of first-stage retrieval algorithms.

### Document Expansion by Query Prediction

• Computer Science
ArXiv
• 2019
A simple method that predicts which queries will be issued for a given document and then expands it with those predictions with a vanilla sequence-to-sequence model, trained using datasets consisting of pairs of query and relevant documents is proposed.

### Adam: A Method for Stochastic Optimization

• Computer Science
ICLR
• 2015
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

### Dense Passage Retrieval for Open-Domain Question Answering

• Computer Science
EMNLP
• 2020
This work shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework.

### REALM: Retrieval-Augmented Language Model Pre-Training

• Computer Science
ArXiv
• 2020
The effectiveness of Retrieval-Augmented Language Model pre-training (REALM) is demonstrated by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA) and is found to outperform all previous methods by a significant margin, while also providing qualitative benefits such as interpretability and modularity.

### From doc2query to docTTTTTquery

The setup in this work follows doc2query, but with T5 as the expansion model, and it is found that the top-k sampling decoder produces more effective queries than beam search.

### Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

• Computer Science
J. Mach. Learn. Res.
• 2020
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

### HuggingFace's Transformers: State-of-the-art Natural Language Processing

• Computer Science
ArXiv
• 2019
The \textit{Transformers} library is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.