• Corpus ID: 220250241

RepBERT: Contextualized Text Embeddings for First-Stage Retrieval

  title={RepBERT: Contextualized Text Embeddings for First-Stage Retrieval},
  author={Jingtao Zhan and Jiaxin Mao and Yiqun Liu and Min Zhang and Shaoping Ma},
Although exact term match between queries and documents is the dominant method to perform first-stage retrieval, we propose a different approach, called RepBERT, to represent documents and queries with fixed-length contextualized embeddings. The inner products of query and document embeddings are regarded as relevance scores. On MS MARCO Passage Ranking task, RepBERT achieves state-of-the-art results among all initial retrieval techniques. And its efficiency is comparable to bag-of-words… 

Figures and Tables from this paper

Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models

A novel Coupled Estimation technique is proposed that learns both a relevance model and a selection model simultaneously to correct the pooling bias for training NRMs and shows that NRMs trained with this technique can achieve significant gains on ranking effectiveness against other baseline strategies.

RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking

A novel joint training approach for dense passage retrieval and passage reranking is proposed, where the dynamic listwise distillation is introduced, where a unified listwise training approach is designed for both the retriever and the re-ranker.

UHD-BERT: Bucketed Ultra-High Dimensional Sparse Representations for Full Ranking

This model, UHD-BERT, maximizes the benefits of ultrahigh dimensional (UHD) sparse representations based on BERT language modeling, by adopting a bucketing method, enabling us to build a powerful and efficient neuro-symbolic information retrieval system.

Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval

An ultra-high dimensional (UHD) representation scheme equipped with directly controllable sparsity and a bucketing method, where the embeddings from multiple layers of BERT are selected/merged to represent diverse linguistic aspects, which outperforms other sparse models.

Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval (Extended Abstract)

RepCONC is a novel retrieval model that learns discrete Representations via CONstrained Clustering as a constrained clustering process, which requires the document embeddings to be uniformly clustered around the quantization centroids and theoretically demonstrate that the uniform clustering constraint facilitates representation distinguishability.

Isotropic Representation Can Improve Dense Retrieval

This work first shows that BERT-based DR also follows an anisotropic distribution, and introduces unsupervised post-processing methods of Normalizing Flow and whitening, and develops token-wise method in addition to the sequence- wise method for applying the post- processing methods to the representations of dense retrieval models to effectively enhance the representations to be isotropic.

Ultron: An Ultimate Retriever on Corpus with a Model-based Indexer

This work proposes Ultron, which encodes the knowledge of all documents into the model and aims to retrieve relevant documents end-to-end, and proposes a three-stage training work to capture more knowledge contained in the corpus and associations between queries and docids.

An Encoder Attribution Analysis for Dense Passage Retriever in Open-Domain Question Answering

It is found that the passage encoder contributes more than the question encoder to in-domain retrieval accuracy, and a probabilistic framework called encoder marginalization is formulated, where the contribution of a single encoder is quantified by marginalizing other variables.

MS MARCO Chameleons: Challenging the MS MARCO Leaderboard with Extremely Obstinate Queries

It is proposed that a well-rounded evaluation strategy for any new ranker would need to include performance measures on both the overall MS MAR CO dataset as well as the proposed MS MARCO Chameleon datasets.

Predicting Efficiency/Effectiveness Trade-offs for Dense vs. Sparse Retrieval Strategy Selection

This work proposes a classifier to select a suitable retrieval strategy (i.e., sparse vs. dense vs. hybrid) for individual queries, and conducts experiments demonstrating an improved range of efficiency/effectiveness trade-offs between purely sparse, purely dense or hybrid retrieval strategies.



Passage Re-ranking with BERT

A simple re-implementation of BERT for query-based passage re-ranking on the TREC-CAR dataset and the top entry in the leaderboard of the MS MARCO passage retrieval task, outperforming the previous state of the art by 27% in MRR@10.

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

This new dataset is aimed to overcome a number of well-known weaknesses of previous publicly available datasets for the same task of reading comprehension and question answering, and is the most comprehensive real-world dataset of its kind in both quantity and quality.

Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval

A Deep Contextualized Term Weighting framework that learns to map BERT's contextualized text representations to context-aware term weights for sentences and passages to improve the accuracy of first-stage retrieval algorithms.

Document Expansion by Query Prediction

A simple method that predicts which queries will be issued for a given document and then expands it with those predictions with a vanilla sequence-to-sequence model, trained using datasets consisting of pairs of query and relevant documents is proposed.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Dense Passage Retrieval for Open-Domain Question Answering

This work shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework.

REALM: Retrieval-Augmented Language Model Pre-Training

The effectiveness of Retrieval-Augmented Language Model pre-training (REALM) is demonstrated by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA) and is found to outperform all previous methods by a significant margin, while also providing qualitative benefits such as interpretability and modularity.

From doc2query to docTTTTTquery

The setup in this work follows doc2query, but with T5 as the expansion model, and it is found that the top-k sampling decoder produces more effective queries than beam search.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

HuggingFace's Transformers: State-of-the-art Natural Language Processing

The \textit{Transformers} library is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.