• Corpus ID: 245131402

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

@article{Wang2021GPLGP,
  title={GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval},
  author={Kexin Wang and Nandan Thakur and Nils Reimers and Iryna Gurevych},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.07577}
}
Dense retrieval approaches can overcome the lexical gap and lead to significantly improved search results. However, they require large amounts of training data which is not available for most domains. As shown in previous work (Thakur et al., 2021b), the performance of dense retrievers severely degrades under a domain shift. This limits the usage of dense retrieval approaches to only a few domains with large training datasets. In this paper, we propose the novel unsupervised domain adaptation… 
Query Generation with External Knowledge for Dense Retrieval
TLDR
This work converts a query into a triplet-based template form to accommodate external information and transmit it to a pre-trained language model (PLM), a novel method for generating queries with external information related to the corresponding document.
InPars: Data Augmentation for Information Retrieval using Large Language Models
TLDR
This work harnesses the fewshot capabilities of large pretrained language models as synthetic data generators for IR tasks and shows that models finetuned solely on the authors' unsupervised dataset outperform strong baselines such as BM25 as well as recently proposed selfsupervised dense retrieval methods.
A Thorough Examination on Zero-shot Dense Retrieval
TLDR
This paper presents the first thorough examination of the zero-shot capability of DR models, and discusses the effect of several key factors related to source training set, analyze the potential bias from the target dataset, and review and compare existing zero- shot DR models.
DuReaderretrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine
TLDR
The experiment results demonstrate that DuReaderretrieval is challenging and there is still plenty of room for the community to improve, e.g. the generalization across domains, salient phrase and syntax mismatch between query and paragraph and robustness.
Evaluating Extrapolation Performance of Dense Retrieval
TLDR
This work proposes two resampling strategies for existing retrieval benchmarks and comprehensively investigates how dense retrieval models perform in both the interpolation and extrapolation regimes, showing that DR models may interpolate as well as complex interaction-based models but extrapolate substantially worse.
ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval
TLDR
This paper introduces a self on-the-fly distillation method that can effectively distill late interaction and in-corporates a cascade distillation process to further improve the performance with a cross-encoder teacher.
Domain Adaptation for Memory-Efficient Dense Retrieval
Dense retrievers encode documents into fixed dimensional embeddings. However, storing all the document embeddings within an index produces bulky indexes which are expensive to serve. Recently, BPR
DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine
TLDR
The experiment results demonstrate that DuReaderretrieval is challenging and there is still plenty of room for the community to improve, e.g. the generalization across domains, salient phrase and syntax mismatch between query and paragraph and robustness.

References

SHOWING 1-10 OF 63 REFERENCES
Embedding-based Zero-shot Retrieval through Query Generation
TLDR
This work considers the embedding-based two-tower architecture as the neural retrieval model and proposes a novel method for generating synthetic training data for retrieval, which produces remarkable results, significantly outperforming BM25 on 5 out of 6 datasets tested.
TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning
TLDR
This work presents a new state-of-the-art unsupervised method based on pre-trained Transformers and Sequential Denoising Auto-Encoder (TSDAE) which outperforms previous approaches by up to 6.4 points and shows that TSDAE is a strong domain adaptation and pre-training method for sentence embeddings, outperforming other approaches like Masked Language Model.
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
TLDR
This work extensively analyzes different retrieval models and provides several suggestions that it believes may be useful for future work, finding that performing well consistently across all datasets is challenging.
Zero-Shot Dense Retrieval with Momentum Adversarial Domain Invariant Representations
TLDR
Momentum adversarial Domain Invariant Representation learning (MoDIR) is proposed, which introduces a momentum method to train a domain classifier that distinguishes source versus target domains, and then adversarially updates the DR encoder to learn domain invariant representations.
Domain-matched Pre-training Tasks for Dense Retrieval
TLDR
This work demonstrates that, with the right pre-training setup, large bi-encoder models on a recently released set of 65 million synthetically generated questions and 200 million post-comment pairs from a preexisting dataset of Reddit conversations made available by pushshift.io can be overcome.
RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
TLDR
A novel joint training approach for dense passage retrieval and passage reranking is proposed, where the dynamic listwise distillation is introduced, where a unified listwise training approach is designed for both the retriever and the re-ranker.
Latent Retrieval for Weakly Supervised Open Domain Question Answering
TLDR
It is shown for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system, and outperforming BM25 by up to 19 points in exact match.
UDALM: Unsupervised Domain Adaptation through Language Modeling
TLDR
UDALM is introduced, a fine-tuning procedure, using a mixed classification and Masked Language Model loss, that can adapt to the target domain distribution in a robust and sample efficient manner and be effectively used as a stopping criterion during UDA training.
Domain-Adversarial Training of Neural Networks
TLDR
A new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions, which can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
...
1
2
3
4
5
...