Adaptive Semiparametric Language Models

@article{Yogatama2021AdaptiveSL,
  title={Adaptive Semiparametric Language Models},
  author={Dani Yogatama and Cyprien de Masson d'Autume and Lingpeng Kong},
  journal={Transactions of the Association for Computational Linguistics},
  year={2021},
  volume={9},
  pages={362-373}
}
Abstract We present a language model that combines a large parametric neural network (i.e., a transformer) with a non-parametric episodic memory component in an integrated architecture. Our model uses extended short-term context by caching local hidden states—similar to transformer-XL—and global long-term memory by retrieving a set of nearest neighbor tokens at each timestep. We design a gating function to adaptively combine multiple information sources to make a prediction. This mechanism… 
Relational Memory-Augmented Language Models
TLDR
A memory-augmented approach to condition an autoregressive language model on a knowledge graph as a collection of relation triples and retrieve relevant relations for a given context to improve text generation.
∞-former: Infinite Memory Transformer
TLDR
The ∞-former is proposed, which extends the vanilla transformer with an unbounded long-term memory and is able to model arbitrarily long contexts and maintain “sticky memories” while keeping a fixed computation budget.
M EMORIZING T RANSFORMERS
TLDR
It is demonstrated that an approximate kNN lookup into a non-differentiable memory of recent (key, value) pairs improves language modeling across various benchmarks and tasks, including generic webtext, math papers, books, code, as well as formal theorems (Isabelle).
Mind the Gap: Assessing Temporal Generalization in Neural Language Models
TLDR
It is argued that now is the right time to rethink the static way in which the authors currently train and evaluate their language models, and develop adaptive language models that can remain up-to-date with respect to their ever-changing and non-stationary world.
Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval
TLDR
This paper presents RETOMATON – retrieval automaton – which approximates the datastore search, based on clustering of entries into “states”, and (2) state transitions from previous entries, which effectively results in a weighted finite automaton built on top of thedatastore, instead of representing the datASTore as a flat list.
Improving language models by retrieving from trillions of tokens
TLDR
Transformers have been scaled from 100 million parameter models in seminal work to over hundred billion parameters in the last two years which has led to models that do very well on a wide array of tasks in a zero or few-shot formulation.
Pitfalls of Static Language Modelling
TLDR
It is argued that now is the right time to rethink the static language modelling evaluation protocol, and develop adaptive language models that can remain up-to-date with respect to the ever-changing and non-stationary world.
You Only Need One Model for Open-domain Question Answering
TLDR
This work proposes casting the retriever and the reranker as hard-attention mechanisms applied sequentially within the transformer architecture and feeding the resulting computed representations to the reader, which leads to better gradient flow when the architecture is trained in an end-to-end manner.
ChapterBreak: A Challenge Dataset for Long-Range Language Models
TLDR
This work introduces C HAPTER B REAK, a challenge dataset that provides an LRLM with a long segment from a narrative that ends at a chapter boundary and asks it to distinguish the beginning of the ground-truth next chapter from a set of negative segments sampled from the same narrative.
A Contrastive Framework for Neural Text Generation
TLDR
This work shows that an underlying reason for model degeneration is the anisotropic distribution of token representations, and presents a contrastive solution: SimCTG and a decoding method— contrastive search —to encourage diversity while maintaining coherence in the generated text.
...
1
2
3
...

References

SHOWING 1-10 OF 47 REFERENCES
Unbounded cache model for online language modeling with open vocabulary
TLDR
This paper uses a large scale non-parametric memory component that stores all the hidden activations seen in the past and leverages recent advances in approximate nearest neighbor search and quantization algorithms to store millions of representations while searching them efficiently.
Episodic Memory in Lifelong Language Learning
TLDR
This work proposes an episodic memory model that performs sparse experience replay and local adaptation to mitigate catastrophic forgetting in a lifelong language learning setup where a model needs to learn from a stream of text examples without any dataset identifier.
Improving Neural Language Models with a Continuous Cache
TLDR
A simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation, which is very efficient and scales to very large memory sizes.
Adaptive Input Representations for Neural Language Modeling
TLDR
Adapt input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity are introduced and a systematic comparison of popular choices for a self-attentional architecture is performed.
Pointer Sentinel Mixture Models
TLDR
The pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank while using far fewer parameters than a standard softmax LSTM and the freely available WikiText corpus is introduced.
Non-Parametric Adaptation for Neural Machine Translation
TLDR
This work proposes a novel n-gram level retrieval approach that relies on local phrase level similarities, allowing us to retrieve neighbors that are useful for translation even when overall sentence similarity is low, and combines this with an expressive neural network, allowing the model to extract information from the noisy retrieved context.
Generalizing and Hybridizing Count-based and Neural Language Models
TLDR
This work demonstrates how both varieties of models for language modeling can be unified in a single modeling framework that defines a set of probability distributions over the vocabulary of words, and then dynamically calculates mixture weights over these distributions.
REALM: Retrieval-Augmented Language Model Pre-Training
TLDR
The effectiveness of Retrieval-Augmented Language Model pre-training (REALM) is demonstrated by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA) and is found to outperform all previous methods by a significant margin, while also providing qualitative benefits such as interpretability and modularity.
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
TLDR
This work introduces a novel theoretical framework that facilitates better learning in language modeling, and shows that this framework leads to tying together the input embedding and the output projection matrices, greatly reducing the number of trainable variables.
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
TLDR
This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.
...
1
2
3
4
5
...