The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes

@inproceedings{Reimers2021TheCO,
  title={The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes},
  author={Nils Reimers and Iryna Gurevych},
  booktitle={ACL},
  year={2021}
}
Information Retrieval using dense low-dimensional representations recently became popular and showed out-performance to traditional sparse-representations like BM25. However, no previous work investigated how dense representations perform with large index sizes. We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain… 

Figures and Tables from this paper

A Thorough Examination on Zero-shot Dense Retrieval
TLDR
This paper presents the first thorough examination of the zero-shot capability of DR models, and discusses the effect of several key factors related to source training set, analyze the potential bias from the target dataset, and review and compare existing zero- shot DR models.
Artefact Retrieval: Overview of NLP Models with Knowledge Base Access
TLDR
This paper systematically describes the typology of artefacts, retrieval mechanisms and the way these artefacts are fused into the model to uncover combinations of design decisions that had not yet been tried in NLP systems.
DS-TOD: Efficient Domain Specialization for Task-Oriented Dialog
TLDR
This work investigates the effects of domain specialization of pretrained language models (PLMs) for TOD and proposes a resource-efficient and modular domain specialization by means of domain adapters – additional parameter-light layers in which to encode the domain knowledge.
Dimension Reduction for Efficient Dense Retrieval via Conditional Autoencoder
TLDR
A Conditional Autoencoder (ConAE) is proposed to compress the high-dimensional embeddings of dense retrieval to maintain the same embedding distribution and better recover the ranking features.
ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval
TLDR
This paper introduces a self on-the-fly distillation method that can effectively distill late interaction and in-corporates a cascade distillation process to further improve the performance with a cross-encoder teacher.
InPars: Data Augmentation for Information Retrieval using Large Language Models
TLDR
This work harnesses the fewshot capabilities of large pretrained language models as synthetic data generators for IR tasks and shows that models finetuned solely on the authors' unsupervised dataset outperform strong baselines such as BM25 as well as recently proposed selfsupervised dense retrieval methods.
LIDER: An Efficient High-dimensional Learned Index for Large-scale Dense Passage Retrieval
TLDR
Experiments show that LIDER has a higher search speed with high retrieval quality comparing to the state-of-the-art ANN indexes commonly used in dense passage retrieval, and has a better capability of speed-quality trade-off.
Modeling Exemplification in Long-form Question Answering via Retrieval
TLDR
This paper proposes to treat exemplification as a retrieval problem in which a partially-written answer is used to query a large set of human-written examples extracted from a corpus, and allows a reliable ranking-type automatic metrics that correlates well with human evaluation.
Multi-Stage Prompting for Knowledgeable Dialogue Generation
TLDR
This paper proposes a multi-stage prompting approach to generate knowledgeable responses from a single pretrained language model (LM) and shows that its knowledge generator outperforms the state-of-the-art retrieval-based model by 5.8% when combining knowledge relevance and correctness.
Unsupervised Ranking and Aggregation of Label Descriptions for Zero-Shot Classifiers
TLDR
This work looks at how probabilistic models of repeated rating analysis can be used for selecting the best label descriptions in an unsupervised fashion and shows that multiple, noisy label descriptions can be aggregated to boost the performance.
...
1
2
...

References

SHOWING 1-10 OF 17 REFERENCES
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
MultiReQA: A Cross-Domain Evaluation forRetrieval Question Answering Models
TLDR
This dataset paper presents MultiReQA, a new multi-domain ReQA evaluation suite composed of eight retrieval QA tasks drawn from publicly available QA datasets, which explores systematic retrieval based evaluation and transfer learning across domains over these datasets using a number of strong base-lines.
Sparse, Dense, and Attentional Representations for Text Retrieval
TLDR
A simple neural model is proposed that combines the efficiency of dual encoders with some of the expressiveness of more costly attentional architectures, and is explored to explore sparse-dense hybrids to capitalize on the precision of sparse retrieval.
Complementing Lexical Retrieval with Semantic Residual Embedding
TLDR
CLEAR is presented, a deep retrieval model that seeks to complement lexical retrieval with semantic embedding retrieval, and uses a residual-based embedding learning framework, which focuses the embedding on the deep language structures and semantics that the lexical retrieved fails to capture.
Dense Passage Retrieval for Open-Domain Question Answering
TLDR
This work shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework.
On the Sentence Embeddings from Pre-trained Language Models
TLDR
This paper proposes to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective and achieves significant performance gains over the state-of-the-art sentence embeddings on a variety of semantic textual similarity tasks.
Overview of the TREC 2020 Deep Learning Track
The Deep Learning Track is a new track for TREC 2019, with the goal of studying ad hoc ranking in a large data regime. It is the first track with large human-labeled training sets, introducing two
REALM: Retrieval-Augmented Language Model Pre-Training
TLDR
The effectiveness of Retrieval-Augmented Language Model pre-training (REALM) is demonstrated by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA) and is found to outperform all previous methods by a significant margin, while also providing qualitative benefits such as interpretability and modularity.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings
TLDR
It is found that in all layers of ELMo, BERT, and GPT-2, on average, less than 5% of the variance in a word’s contextualized representations can be explained by a static embedding for that word, providing some justification for the success of contextualization representations.
...
1
2
...