Efficient Passage Retrieval with Hashing for Open-domain Question Answering

  title={Efficient Passage Retrieval with Hashing for Open-domain Question Answering},
  author={Ikuya Yamada and Akari Asai and Hannaneh Hajishirzi},
Most state-of-the-art open-domain question answering systems use a neural retrieval model to encode passages into continuous vectors and extract them from a knowledge source. However, such retrieval models often require large memory to run because of the massive size of their passage index. In this paper, we introduce Binary Passage Retriever (BPR), a memory-efficient neural retrieval model that integrates a learning-to-hash technique into the state-of-the-art Dense Passage Retriever (DPR) to… 

Figures and Tables from this paper

Encoder Adaptation of Dense Passage Retrieval for Open-Domain Question Answering

Different combinations of DPR’s question and passage encoder learned from five benchmark QA datasets on both indomain and out-of-domain questions are inspected to answer the question how an in-distribution question/passage encoder would generalize if paired with an OOD passage/question encoder from another domain.

Two-Step Question Retrieval for Open-Domain QA

A two-step question retrieval model, SQuID (Sequential Question-Indexed Dense retrieval) and distant supervision for training and results show that SQuIDs significantly increases the performance of existing question retrieval models with a negligible loss on inference speed.

LIDER: An Efficient High-dimensional Learned Index for Large-scale Dense Passage Retrieval

Experiments show that LIDER has a higher search speed with high retrieval quality comparing to the state-of-the-art ANN indexes on passage retrieval tasks, e.g., on large-scale data it achieves 1.2x search speed and significantly higher retrieval quality than the fastest baseline in the authors' evaluation.

An Encoder Attribution Analysis for Dense Passage Retriever in Open-Domain Question Answering

It is found that the passage encoder contributes more than the question encoder to in-domain retrieval accuracy, and a probabilistic framework called encoder marginalization is formulated, where the contribution of a single encoder is quantified by marginalizing other variables.

Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

This work addresses the problem of massive-scale embedding-based retrieval with Bi-Granular Document Representation, where the lightweight sparse embeddings are indexed and standby in memory for coarse-grained candidate search, and the heavyweight dense embedDings are hosted in disk for fine- grained post verification.

Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval

Baleen, a system that improves the accuracy of multi-hop retrieval while learning robustly from weak training signals in the many-hop setting, is evaluated on retrieval for two-hop question answering and many-Hop claim verification, establishing state-of-the-art performance.

Domain Adaptation for Memory-Efficient Dense Retrieval

It is shown that binary embedding models like BPR and JPQ can perform signif-icantly worse than baselines once there is a domain-shift involved, and a modi-cation to the training procedure is proposed and combined with a corpus specific generative procedure which allow the adaptation of BPRand JPQ to any corpus without requiring labeled training data.

Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

A novel multi-domain Chinese dataset for passage retrieval (Multi-CPR) is presented, collected from three different domains, including E-commerce, Entertainment video and Medical, which demonstrates the necessity of domain labeled data for further optimization.

Knowledge Base Index Compression via Dimensionality and Precision Reduction

This work systematically investigates reducing the size of the KB index by means of dimensionality (sparse random projections, PCA, autoencoders) and numerical precision reduction and shows that PCA is an easy solution that requires very little data and is only slightly worse than autoen coders, which are less stable.

Quality and Cost Trade-offs in Passage Re-ranking Task

This paper investigated several late-interaction models such as Colbert and Poly-encoder architectures along with their modifications, and took care of the memory footprint of the search index and tried to apply the learning-to-hash method to binarize the output vectors from the transformer encoders.



Dense Passage Retrieval for Open-Domain Question Answering

This work shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework.

A Memory Efficient Baseline for Open Domain Question Answering

This paper considers three strategies to reduce the index size of dense retriever-reader systems: dimension reduction, vector quantization and passage filtering, and shows that it is possible to get competitive systems using less than 6Gb of memory.

Hashing based Answer Selection

A novel method, called hashing based answer selection (HAS), which adopts a hashing strategy to learn a binary matrix representation for each answer, which can dramatically reduce the memory cost for storing the matrix representations of answers.

Reading Wikipedia to Answer Open-Domain Questions

This approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs, indicating that both modules are highly competitive with respect to existing counterparts.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

A general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation, and finds that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

Natural Questions: A Benchmark for Question Answering Research

The Natural Questions corpus, a question answering data set, is presented, introducing robust metrics for the purposes of evaluating question answering systems; demonstrating high human upper bounds on these metrics; and establishing baseline results using competitive methods drawn from related literature.

Deep Hashing Network for Efficient Similarity Retrieval

A novel Deep Hashing Network (DHN) architecture for supervised hashing is proposed, in which good image representation tailored to hash coding and formally control the quantization error are jointly learned.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

HashNet: Deep Learning to Hash by Continuation

HashNet is presented, a novel deep architecture for deep learning to hash by continuation method with convergence guarantees, which learns exactly binary hash codes from imbalanced similarity data.

HashGAN: Deep Learning to Hash with Pair Conditional Wasserstein GAN

HashGAN is presented, a novel architecture for deep learning to hash, which learns compact binary hash codes from both real images and diverse images synthesized by generative models, conditioned on the pairwise similarity information.