Questions Are All You Need to Train a Dense Passage Retriever

  title={Questions Are All You Need to Train a Dense Passage Retriever},
  author={Devendra Singh Sachan and Mike Lewis and Dani Yogatama and Luke Zettlemoyer and Jo{\"e}lle Pineau and Manzil Zaheer},
We introduce ART, a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data. Dense retrieval is a cen-tral challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. ART, in con-trast, only requires access to unpaired in-puts and outputs (e.g. questions and poten-tial answer passages). It uses a… 

Figures and Tables from this paper

Generate rather than Retrieve: Large Language Models are Strong Context Generators

The proposed method is evaluated on three different knowledge-intensive tasks and its effectiveness on both zero-shot and supervised settings is demonstrated.



The Probabilistic Relevance Framework: BM25 and Beyond

This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F.

Dense Passage Retrieval for Open-Domain Question Answering

This work shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

The power of scale for parameterefficient prompt tuning

  • Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
  • 2021

Improving Passage Retrieval with Zero-Shot Question Generation

A simple and effective re-ranking method for improving passage retrieval in open question answering that improves strong unsupervised retrieval models by 6%-18% absolute and strong supervised models by up to 12% in terms of top-20 passage retrieval accuracy.

End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering

An end-to-end differentiable training method for retrieval-augmented open-domain question answering systems that combine information from multiple retrieved documents when generating answers and demonstrates the feasibility of learning to retrieve to improve answer generation without explicit supervision of retrieval decisions.

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

Approximate nearest neighbor Negative Contrastive Estimation (ANCE) is presented, a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

End-to-End Training of Neural Retrievers for Open-Domain Question Answering

An approach of unsupervised pre-training with the Inverse Cloze Task and masked salient spans, followed by supervised finetuning using question-context pairs leads to absolute gains over the previous best result in the top-20 retrieval accuracy on Natural Questions and TriviaQA datasets.