• Corpus ID: 207930593

Compressive Transformers for Long-Range Sequence Modelling

  title={Compressive Transformers for Long-Range Sequence Modelling},
  author={Jack W. Rae and Anna Potapenko and Siddhant M. Jayakumar and Timothy P. Lillicrap},
We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range… 
DCT: Dynamic Compressive Transformer for Modeling Unbounded Sequence
The proposed Dynamic Compressive Transformer model uses a policy that determines whether the sequence should be kept in memory with a compressed state or discarded during the training process, with the benefits of retaining semantically meaningful sentence information in the memory system.
Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory
This work proposed a novel augmentedmemory self-attention, which attends on a short segment of the input sequence and a bank of memories, which stores the embedding information for all the processed seg-ments.
Compressive Performers in Language Modelling
This work introduces the Compressive Performer, a hybrid Transformer variant based on two existing model architectures: the Performer, which reduces the memory requirement and processing time of the
Memformer: The Memory-Augmented Transformer
Results show that Memformer outperforms the previous long-range sequence models on WikiText-103, including Transformer-XL and compressive Transformer, and is also compatible with other self-supervised tasks to further improve the performance on language modeling.
TransformerFTC: Scaling Low-Dimensional Transformers for Higher Performance
This work presents TRANSFORMERFTC, a Transformer architecture that utilizes subsampling methods which allow for training deep Transformers at a lower computational cost and suggests that compressing the upper layers of a Trans transformer are a promising strategy for model efficiency.
Exploring Transformers for Large-Scale Speech Recognition
It is shown that Transformers can achieve around 6% relative word error rate (WER) reduction compared to the BLSTM baseline in the offline fashion, while in the streaming fashion, Transformer-XL is comparable to LC-BLSTM with 800 millisecond latency constraint.
Efficient Transformers: A Survey
This paper characterizes a large and thoughtful selection of recent efficiency-flavored “X-former” models, providing an organized and comprehensive overview of existing work and models across multiple domains.
Longformer: The Long-Document Transformer
Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks.
Linearizing Transformer with Key-Value Memory Bank
It is demonstrated that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer and other linear variants in language modeling and machine translation tasks, revealing a viable direction to-wards further inference ef-ciency improve-ment.
Memory Transformer
This work proposes and study two extensions of the Transformer baseline by adding memory tokens to store non-local representations, and creating memory bottleneck for the global information, and evaluates these memory augmented Transformers on machine translation task and demonstrates that memory size positively correlates with the model performance.


Generating Long Sequences with Sparse Transformers
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.
Character-Level Language Modeling with Deeper Self-Attention
This paper shows that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Multiplicative LSTM for sequence modelling
It is demonstrated empirically that mLSTM outperforms standard LSTM and its deep variants for a range of character level language modelling tasks, and is argued makes it more expressive for autoregressive density estimation.
Trellis Networks for Sequence Modeling
Trellis networks are presented, a new architecture for sequence modeling that outperform the current state of the art methods on a variety of challenging benchmarks, including word-level language modeling and character- level language modeling tasks, and stress tests designed to evaluate long-term memory retention.
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.
Language Modeling with Gated Convolutional Networks
A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.