• Corpus ID: 233219855

A Replication Study of Dense Passage Retriever

  title={A Replication Study of Dense Passage Retriever},
  author={Xueguang Ma and Kai Sun and Ronak Pradeep and Jimmy J. Lin},
Text retrieval using learned dense representations has recently emerged as a promising alternative to “traditional” text retrieval using sparse bag-of-words representations. One recent work that has garnered much attention is the dense passage retriever (DPR) technique proposed by Karpukhin et al. (2020) for endto-end open-domain question answering. We present a replication study of this work, starting with model checkpoints provided by the authors, but otherwise from an independent… 

Figures and Tables from this paper

Improving Passage Retrieval with Zero-Shot Question Generation

A simple and effective re-ranking method for improving passage retrieval in open question answering that improves strong unsupervised retrieval models by 6%-18% absolute and strong supervised models by up to 12% in terms of top-20 passage retrieval accuracy.

Learning to Retrieve Passages without Supervision

The resulting model, named Spider, performs surprisingly well without any labeled training examples on a wide range of ODQA datasets, and significantly outperforms all other pretrained baselines in a zero-shot setting, and is competitive with BM25, a strong sparse baseline.

Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval

Experiments show that although the effectiveness of mDPR is much lower than BM25, dense representations nevertheless appear to provide valuable relevance signals, improving BM25 results in sparse–dense hybrids.

Multi-Task Dense Retrieval via Model Uncertainty Fusion for Open-Domain Question Answering

This work proposes to train individual dense passage retrievers (DPR) for different tasks and aggregate their predictions during test time, where they use uncertainty estimation as weights to in-dicate how probable a specific query belongs to each expert’s expertise.

Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?

The Salient Phrase Aware Retriever (SPAR) is introduced, a dense retriever with the lexical matching capacity of a sparse model and sets a new state of the art for dense and sparse retrievers and can match or exceed the performance of more complicated densesparse hybrid systems.

Towards Unsupervised Dense Information Retrieval with Contrastive Learning

This work explores the limits of contrastive learning as a way to train unsupervised dense retrievers, and shows that it leads to strong retrieval performance on the BEIR benchmark.

Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations

An overview of toolkit features is provided and empirical results that illustrate its effectiveness on two popular ranking tasks are presented, as well as hybrid retrieval that integrates both approaches.

A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques

This work presents a novel technique dubbed “uniCOIL”, a simple extension of COIL that achieves the current state-of-the-art in sparse retrieval on the popular MS MARCO passage ranking dataset.

Unsupervised Dense Information Retrieval with Contrastive Learning

This work explores the limits of contrastive learning as a way to train unsupervised dense retrievers and shows that it leads to strong performance in various retrieval settings and performs cross-lingual retrieval between scripts, which would not be possible with term matching methods.

A proposed conceptual framework for a representational approach to information retrieval

A representational approach that breaks the core text retrieval problem into a logical scoring model and a physical retrieval model that establishes connections to sentence similarity tasks in natural language processing and information access "technologies" prior to the dawn of computing.



Generation-Augmented Retrieval for Open-Domain Question Answering

It is shown that generating diverse contexts for a query is beneficial as fusing their results consistently yields better retrieval accuracy, and as sparse and dense representations are often complementary, GAR can be easily combined with DPR to achieve even better performance.

Dense Passage Retrieval for Open-Domain Question Answering

This work shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework.

Distant Supervision for Multi-Stage Fine-Tuning in Retrieval-Based Question Answering

This architecture tackles the problem of question answering directly on a large document collection, combining simple “bag of words” passage retrieval with a BERT-based reader for extracting answer spans with large gains in effectiveness on two English and two Chinese QA datasets.

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

An overview of toolkit features and empirical results that illustrate its effectiveness on two popular ranking tasks are presented and how the group has built a culture of replicability through shared norms and tools that enable rigorous automated testing is described.

Distilling Dense Representations for Ranking using Tightly-Coupled Teachers

This work distill the knowledge from ColBERT's expressive MaxSim operator for computing relevance scores into a simple dot product, thus enabling single-step ANN search and improves query latency and greatly reduces the onerous storage requirements of ColberT.

Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

This work proposes a cross-architecture training procedure with a margin focused loss (Margin-MSE), that adapts knowledge distillation to the varying score output distributions of different BERT and non-BERT ranking architectures, and shows that across evaluated architectures it significantly improves their effectiveness without compromising their efficiency.

Reading Wikipedia to Answer Open-Domain Questions

This approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs, indicating that both modules are highly competitive with respect to existing counterparts.

The TREC-8 Question Answering Track Evaluation

The TREC-8 Question Answering track was the first large-scale evaluation of systems that return answers, as opposed to lists of documents, in response to a question, and the examination uncovered no serious flaws in the methodology, supporting its continued use for question answering evaluation.

End-to-End Open-Domain Question Answering with BERTserini

An end-to-end question answering system that integrates BERT with the open-source Anserini information retrieval toolkit is demonstrated, showing that fine-tuning pretrained Bert with SQuAD is sufficient to achieve high accuracy in identifying answer spans.

Anserini: Enabling the Use of Lucene for Information Retrieval Research

Anserini provides wrappers and extensions on top of core Lucene libraries that allow researchers to use more intuitive APIs to accomplish common research tasks, and aims to provide the best of both worlds to better align information retrieval practice and research.