Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation

  title={Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation},
  author={Soyeong Jeong and Jinheon Baek and Sukmin Cho and Sung Ju Hwang and Jong Chun Park},
Dense retrieval models, which aim at retrieving the most relevant document for an input query on a dense representation space, have gained considerable attention for their remarkable success. Yet, dense models require a vast amount of labeled training data for notable performance, whereas it is often challenging to acquire query-document pairs annotated by humans. To tackle this problem, we propose a simple but effective Document Augmentation for dense Retrieval (DAR) framework, which augments… 

Figures and Tables from this paper



Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

Approximate nearest neighbor Negative Contrastive Estimation (ANCE) is presented, a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances.

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.

Embedding-based Zero-shot Retrieval through Query Generation

This work considers the embedding-based two-tower architecture as the neural retrieval model and proposes a novel method for generating synthetic training data for retrieval, which produces remarkable results, significantly outperforming BM25 on 5 out of 6 datasets tested.

Dense Passage Retrieval for Open-Domain Question Answering

This work shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

BatchMixup: Improving Training by Interpolating Hidden States of the Entire Mini-batch

This work proposes BATCHMIXUP—improving the model learning by interpolating hidden states of the entire mini-batch and shows superior performance than competitive baselines in improving the performance of NLP tasks while using different ratios of training data.

Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation

Empirically, it is shown that this is an effective strategy for building neural passage retrieval models in the absence of large training corpora and depending on the domain, this technique can even approach the accuracy of supervised models.

RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering

This work proposes an optimized training approach, called RocketQA, to improving dense passage retrieval, which significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions and demonstrates that the performance of end-to-end QA can be improved based on theRocketQA retriever.

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

By mixing labeled, unlabeled and augmented data, MixText significantly outperformed current pre-trained and fined-tuned models and other state-of-the-art semi-supervised learning methods on several text classification benchmarks.

An Axiomatic Approach to Regularizing Neural Ranking Models

This work explores the use of IR axioms to augment the direct supervision from labeled data for training neural ranking models and shows that the neural ranking model achieves faster convergence and better generalization with axiomatic regularization.