• Corpus ID: 244954723

Improving language models by retrieving from trillions of tokens

@article{Borgeaud2021ImprovingLM,
  title={Improving language models by retrieving from trillions of tokens},
  author={Sebastian Borgeaud and Arthur Mensch and Jordan Hoffmann and Trevor Cai and Eliza Rutherford and Katie Millican and George van den Driessche and Jean-Baptiste Lespiau and Bogdan Damoc and Aidan Clark and Diego de Las Casas and Aurelia Guy and Jacob Menick and Roman Ring and T. W. Hennigan and Saffron Huang and Lorenzo Maggiore and Chris Jones and Albin Cassirer and Andy Brock and Michela Paganini and Geoffrey Irving and Oriol Vinyals and Simon Osindero and Karen Simonyan and Jack W. Rae and Erich Elsen and L. Sifre},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.04426}
}
Language modelling (LM) is an unsupervised task that consists of modelling the probability of text, usually by factorising it into conditional next-token predictions p(x1, . . . , xn) = ∏ i p(xi |x<i). Neural networks have proven to be powerful language models, first in the form of recurrent architectures (Graves, 2013; Jozefowicz et al., 2016; Mikolov et al., 2010) and more recently in the form of Transformers (Vaswani et al., 2017), that use attention to contextualise the past. Large… 
Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval
TLDR
This paper presents RETOMATON – retrieval automaton – which approximates the datastore search, based on clustering of entries into “states”, and (2) state transitions from previous entries, which effectively results in a weighted finite automaton built on top of thedatastore, instead of representing the datASTore as a flat list.
PaLM: Scaling Language Modeling with Pathways
TLDR
A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.
Learning To Retrieve Prompts for In-Context Learning
TLDR
An efficient method for retrieving prompts for in-context learning using annotated data and a LM is proposed, which substantially outperforms prior work and multiple baselines across the board.
Training Compute-Optimal Large Language Models
TLDR
This paper trains a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4 × more more data, and reaches a state-of-the-art average accuracy on the MMLU benchmark.
ChapterBreak: A Challenge Dataset for Long-Range Language Models
TLDR
This work introduces C HAPTER B REAK, a challenge dataset that provides an LRLM with a long segment from a narrative that ends at a chapter boundary and asks it to distinguish the beginning of the ground-truth next chapter from a set of negative segments sampled from the same narrative.
TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models
TLDR
This work introduces T EMPORAL W IKI, a lifelong benchmark for ever-evolving LMs that utilizes the difference between consecutive snapshots of English Wikipedia and English Wikidata for training and evaluation, respectively and proves that factual knowledge in LMs can be safely updated with minimal training data via continual learning.
LaMDA: Language Models for Dialog Applications
TLDR
It is demonstrated that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals
TLDR
This work conducts a comprehensive empirical study, and proposes a recipe, namely “Model generated dEnoising TRaining Objective” (METRO), which incorporates some of the best modeling techniques developed recently to speed up, stabilize, and enhance pretrained language models without compromising model effectiveness.
Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering
TLDR
A new QA system which aug-ments a text-to-text model with a large memory of question-answer pairs, and a new pre-training task for the latent step of question retrieval, which greatly improves performance on smaller QA benchmarks.
Prompt-based model editing
  • Computer Science
  • 2022
TLDR
It is shown that the prompt-based editor rivals the performance of SERAC on natural language inference and question-answering editing tasks and shows promise in being able to generalize to unseen base model architectures and to reliably modulate language model outputs.
...
1
2
3
4
...

References

SHOWING 1-10 OF 60 REFERENCES
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Exploring the Limits of Language Modeling
TLDR
This work explores recent advances in Recurrent Neural Networks for large scale Language Modeling, and extends current models to deal with two key challenges present in this task: corpora and vocabulary sizes, and complex, long term structure of language.
A Neural Knowledge Language Model
TLDR
A Neural Knowledge Language Model (NKLM) which combines symbolic knowledge provided by a knowledge graph with the RNN language model, and shows that the NKLM significantly improves the perplexity while generating a much smaller number of unknown words.
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
TLDR
This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
TLDR
This work presents the Pile, an 825 GiB English text corpus tar-geted at training large-scale language models, constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources.
Adaptive Input Representations for Neural Language Modeling
TLDR
Adapt input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity are introduced and a systematic comparison of popular choices for a self-attentional architecture is performed.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
The LAMBADA dataset: Word prediction requiring a broad discourse context
TLDR
It is shown that LAMBADA exemplifies a wide range of linguistic phenomena, and that none of several state-of-the-art language models reaches accuracy above 1% on this novel benchmark.
Pointer Sentinel Mixture Models
TLDR
The pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank while using far fewer parameters than a standard softmax LSTM and the freely available WikiText corpus is introduced.
...
1
2
3
4
5
...