Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
@article{Dai2019TransformerXLAL, title={Transformer-XL: Attentive Language Models beyond a Fixed-Length Context}, author={Zihang Dai and Zhilin Yang and Yiming Yang and Jaime G. Carbonell and Quoc V. Le and Ruslan Salakhutdinov}, journal={ArXiv}, year={2019}, volume={abs/1901.02860} }
Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. [] Key Method Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers…
Figures and Tables from this paper
2,251 Citations
Effificent Language Modeling of Long-Term Dependencies
- Computer Science
- 2020
This work study and improve properties of the reformer as a language model, introducing k-means clustering for attention and connection tying in reversible layers to improve reformer complexity and representational power.
Shortformer: Better Language Modeling using Shorter Inputs
- Computer ScienceACL
- 2021
This work identifies conditions where shorter inputs are not harmful, and achieves perplexity and efficiency improvements through two new methods that decrease input length, and shows how to improve the efficiency of recurrence methods in transformers.
Segatron: Segment-Aware Transformer for Language Modeling and Understanding
- Computer ScienceAAAI
- 2021
A segment-aware Transformer (Segatron) is proposed, by replacing the original token position encoding with a combined position encoding of paragraph, sentence, and token, and it is hypothesized that better contextual representations can be generated from the Transformer with richer positional information.
Do Long-Range Language Models Actually Use Long-Range Context?
- Computer ScienceEMNLP
- 2021
This paper performs a fine-grained analysis of two long-range Transformer language models (including the Routing Transformer, which achieves state-of-the-art perplexity on the PG-19 long-sequence LM benchmark dataset) that accept input sequences of up to 8K tokens and discovers that long-ranging context helps most for literary novels.
ERNIE-Doc: A Retrospective Long-Document Modeling Transformer
- Computer ScienceACL
- 2021
Two welldesigned techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism, enable ERNIE-DOC 1, which has a much longer effective context length, to capture the contextual information of a complete document.
On Efficient Training, Controllability and Compositional Generalization of Insertion-based Language Generators
- Computer ScienceArXiv
- 2021
The proposed InsNet is an insertion-based sequence model that can be trained as efficiently as traditional transformer decoders while maintaining the same performance as that with a bi-directional context encoder, and is evaluated on story generation and CleVR-CoGENT captioning.
Sentence Simplification with Transformer-XL and Paraphrase Rules
- Computer Science
- 2019
This project is the first known application of Transformer-XL to a non-LM task, as well as the firstknown sentence simplification model that can use character embeddings, and proposes and investigates two original approaches to incorporate the Simple Paraphrase Database (PPDB), a large database of reduction rules for words and phrases.
I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths
- Computer ScienceArXiv
- 2020
I-BERT is proposed, a bi-directional Transformer that replaces positional encodings with a recurrent layer that inductively generalizes on a variety of algorithmic tasks where state-of-the-art Transformer models fail to do so.
Big Bird: Transformers for Longer Sequences
- Computer ScienceNeurIPS
- 2020
It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.
LaMemo: Language Modeling with Look-Ahead Memory
- Computer ScienceNAACL
- 2022
Look-Ahead Memory (LaMemo) is proposed that enhances the recurrence memory by incrementally attending to the right-side tokens and interpolating with the old memory states to maintain long-term information in the history.
References
SHOWING 1-10 OF 69 REFERENCES
Language Modeling with Gated Convolutional Networks
- Computer ScienceICML
- 2017
A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.
An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation
- Computer ScienceArXiv
- 2018
In experiments on symbolic music, relative selfattention substantially improves sample quality for unconditioned generation and is able to generate sequences of lengths longer than those from the training set, making it possible to train much longer sequences and achieve faster convergence.
Character-Level Language Modeling with Deeper Self-Attention
- Computer ScienceAAAI
- 2019
This paper shows that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8.
Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context
- Computer ScienceACL
- 2018
This paper investigates the role of context in an LSTM LM, through ablation studies, and analyzes the increase in perplexity when prior context words are shuffled, replaced, or dropped to provide a better understanding of how neural LMs use their context.
Recurrent Highway Networks
- Computer ScienceICML
- 2017
A novel theoretical analysis of recurrent networks based on Gersgorin's circle theorem is introduced that illuminates several modeling and optimization issues and improves the understanding of the LSTM cell.
Larger-Context Language Modelling
- Computer ScienceArXiv
- 2015
It is found that content words, including nouns, adjec- tives and verbs, benefit most from an increasing number of context sentences, and this analysis suggests that larger-context language model improves the unconditional language model by capturing the theme of a document better and more easily.
An Analysis of Neural Language Modeling at Multiple Scales
- Computer ScienceArXiv
- 2018
This work takes existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity, achieving state- of- the-art results on character- level and word-level datasets.
Attention is All you Need
- Computer ScienceNIPS
- 2017
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
TopicRNN: A Recurrent Neural Network with Long-Range Semantic Dependency
- Computer ScienceICLR
- 2017
In this paper, we propose TopicRNN, a recurrent neural network (RNN)-based language model designed to directly capture the global semantic meaning relating words in a document via latent topics.…
Multiplicative LSTM for sequence modelling
- Computer ScienceICLR
- 2017
It is demonstrated empirically that mLSTM outperforms standard LSTM and its deep variants for a range of character level language modelling tasks, and is argued makes it more expressive for autoregressive density estimation.