Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

  title={Transformer-XL: Attentive Language Models beyond a Fixed-Length Context},
  author={Zihang Dai and Zhilin Yang and Yiming Yang and Jaime G. Carbonell and Quoc V. Le and Ruslan Salakhutdinov},
Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. [] Key Method Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers…

Effificent Language Modeling of Long-Term Dependencies

This work study and improve properties of the reformer as a language model, introducing k-means clustering for attention and connection tying in reversible layers to improve reformer complexity and representational power.

Shortformer: Better Language Modeling using Shorter Inputs

This work identifies conditions where shorter inputs are not harmful, and achieves perplexity and efficiency improvements through two new methods that decrease input length, and shows how to improve the efficiency of recurrence methods in transformers.

Segatron: Segment-Aware Transformer for Language Modeling and Understanding

A segment-aware Transformer (Segatron) is proposed, by replacing the original token position encoding with a combined position encoding of paragraph, sentence, and token, and it is hypothesized that better contextual representations can be generated from the Transformer with richer positional information.

Do Long-Range Language Models Actually Use Long-Range Context?

This paper performs a fine-grained analysis of two long-range Transformer language models (including the Routing Transformer, which achieves state-of-the-art perplexity on the PG-19 long-sequence LM benchmark dataset) that accept input sequences of up to 8K tokens and discovers that long-ranging context helps most for literary novels.

ERNIE-Doc: A Retrospective Long-Document Modeling Transformer

Two welldesigned techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism, enable ERNIE-DOC 1, which has a much longer effective context length, to capture the contextual information of a complete document.

On Efficient Training, Controllability and Compositional Generalization of Insertion-based Language Generators

The proposed InsNet is an insertion-based sequence model that can be trained as efficiently as traditional transformer decoders while maintaining the same performance as that with a bi-directional context encoder, and is evaluated on story generation and CleVR-CoGENT captioning.

Sentence Simplification with Transformer-XL and Paraphrase Rules

This project is the first known application of Transformer-XL to a non-LM task, as well as the firstknown sentence simplification model that can use character embeddings, and proposes and investigates two original approaches to incorporate the Simple Paraphrase Database (PPDB), a large database of reduction rules for words and phrases.

I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths

I-BERT is proposed, a bi-directional Transformer that replaces positional encodings with a recurrent layer that inductively generalizes on a variety of algorithmic tasks where state-of-the-art Transformer models fail to do so.

Big Bird: Transformers for Longer Sequences

It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.

LaMemo: Language Modeling with Look-Ahead Memory

Look-Ahead Memory (LaMemo) is proposed that enhances the recurrence memory by incrementally attending to the right-side tokens and interpolating with the old memory states to maintain long-term information in the history.



Language Modeling with Gated Convolutional Networks

A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.

An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation

In experiments on symbolic music, relative selfattention substantially improves sample quality for unconditioned generation and is able to generate sequences of lengths longer than those from the training set, making it possible to train much longer sequences and achieve faster convergence.

Character-Level Language Modeling with Deeper Self-Attention

This paper shows that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8.

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

This paper investigates the role of context in an LSTM LM, through ablation studies, and analyzes the increase in perplexity when prior context words are shuffled, replaced, or dropped to provide a better understanding of how neural LMs use their context.

Recurrent Highway Networks

A novel theoretical analysis of recurrent networks based on Gersgorin's circle theorem is introduced that illuminates several modeling and optimization issues and improves the understanding of the LSTM cell.

Larger-Context Language Modelling

It is found that content words, including nouns, adjec- tives and verbs, benefit most from an increasing number of context sentences, and this analysis suggests that larger-context language model improves the unconditional language model by capturing the theme of a document better and more easily.

An Analysis of Neural Language Modeling at Multiple Scales

This work takes existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity, achieving state- of- the-art results on character- level and word-level datasets.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

TopicRNN: A Recurrent Neural Network with Long-Range Semantic Dependency

In this paper, we propose TopicRNN, a recurrent neural network (RNN)-based language model designed to directly capture the global semantic meaning relating words in a document via latent topics.

Multiplicative LSTM for sequence modelling

It is demonstrated empirically that mLSTM outperforms standard LSTM and its deep variants for a range of character level language modelling tasks, and is argued makes it more expressive for autoregressive density estimation.