Moving Stuff Around: A study on efficiency of moving documents into memory for Neural IR models

  title={Moving Stuff Around: A study on efficiency of moving documents into memory for Neural IR models},
  author={A. C{\^a}mara and Claudia Hauff},
When training neural rankers using Large Language Models, it’s expected that a practitioner would make use of multiple GPUs to accelerate the training time. By using more devices, deep learning frameworks, like PyTorch, allow the user to drastically increase the available VRAM pool, making larger batches possible when training, therefore shrinking training time. At the same time, one of the most critical processes, that is generally overlooked when running data-hungry models, is how data is… 

Figures and Tables from this paper


PyTerrier: Declarative Experimentation in Python from BM25 to Dense Retrieval
The PyTerrier framework is expanded to include additional support for state-of-the-art BERT-based text re-rankers and dense retrieval implementations (such as ANCE and ColBERT), and is highlighted as a framework for information retrieval researchers and educators.
SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking
This work presents a new first-stage ranker based on explicit sparsity regularization and a log-saturation effect on term weights, leading to highly sparse representations and competitive results with respect to state-of-the-art dense and sparse methods.
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling
This work introduces an efficient topic-aware query and balanced margin sampling technique, called TAS-Balanced, and produces the first dense retriever that outperforms every other method on recall at any cutoff on TREC-DL and allows more resource intensive re-ranking models to operate on fewer passages to improve results further.
Simplified Data Wrangling with ir_datasets
A new robust and lightweight tool for acquiring, managing, and performing typical operations over datasets used in IR, primarily focus on textual datasets used for ad-hoc search.
Weakly Supervised Label Smoothing
Inspired by the investigation of LS in the context of neural L2R models, a novel technique called Weakly Supervised Label Smoothing (WSLS) is proposed, that takes advantage of the retrieval scores of the negative sampled documents as a weak supervision signal in the process of modifying the ground-truth labels.
Array programming with NumPy
How a few fundamental array concepts lead to a simple and powerful programming paradigm for organizing, exploring and analysing scientific data is reviewed.
Local Self-Attention over Long Text for Efficient Document Retrieval
A local self-attention which considers a moving window over the document terms and for each term attends only to other terms in the same window resulting in increased retrieval of longer documents at moderate increase in compute and memory costs is proposed.
Overview of the TREC 2020 Deep Learning Track
The Deep Learning Track is a new track for TREC 2019, with the goal of studying ad hoc ranking in a large data regime. It is the first track with large human-labeled training sets, introducing two
Transformers: State-of-the-Art Natural Language Processing
Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.