Corpus ID: 211252737

Accessing Higher-level Representations in Sequential Transformers with Feedback Memory

  title={Accessing Higher-level Representations in Sequential Transformers with Feedback Memory},
  author={Angela Fan and Thibaut Lavril and Edouard Grave and Armand Joulin and Sainbayar Sukhbaatar},
Transformers are feedforward networks that can process input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input - the representation at a given layer can only access representations from lower layers, rather than the higher level representations already built in previous time steps. In this work, we propose the Feedback Transformer architecture that exposes all previous… Expand

Paper Mentions

Do Transformers Need Deep Long-Range Memory?
This work performs a set of interventions to show that comparable performance can be obtained with 6X fewer long range memories and better performance can been obtained by limiting the range of attention in lower layers of the network. Expand
E.T.: Entity-Transformers. Coreference augmented Neural Language Model for richer mention representations via Entity-Transformer blocks
This model extends the Transformer layers architecture of GPT2 to Entity-Transformers, an architecture designed to handle coreference information when present, to achieve richer representations for entity mentions, with insignificant training cost. Expand
Language Model Using Neural Turing Machine Based on Localized Content-Based Addressing
The performance of a long short-term memory (LSTM) recurrent neural network (RNN)-based language model has been improved on language model benchmarks. Although a recurrent layer has been widely used,Expand
Hierarchical Memory Decoder for Visual Narrating
A novel memory decoder for visual narrating is devised consisting of multiple memory layers that alleviates dilution of long-term information and leverages the latent information of each layer, which is helpful for generating accurate descriptions. Expand
Dispatcher: A Message-Passing Approach To Language Modelling
A new layer type is introduced that aims to substitute self-attention for unidirectional sequence generation tasks and is seen to achieve comparable perplexity to prior results while being more efficient. Expand
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
SRU++ is presented, a recurrent unit with optional built-in attention that exhibits state-of-the-art modeling capacity and training efficiency and reaffirm that attention is not all the authors need and can be complementary to other sequential modeling modules. Expand


Augmenting Self-attention with Persistent Memory
A new model that solely consists of attention layers is proposed that augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Expand
Universal Transformers
The Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses issues of parallelizability and global receptive field, is proposed. Expand
Modeling Recurrence for Transformer
This work proposes to directly model recurrence for Transformer with an additional recurrence encoder, and introduces a novel attentive recurrent network to leverage the strengths of both attention models and recurrent networks. Expand
Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets
The limitations of standard deep learning approaches are discussed and it is shown that some of these limitations can be overcome by learning how to grow the complexity of a model in a structured way. Expand
R-Transformer: Recurrent Neural Network Enhanced Transformer
The R-Transformer is proposed which enjoys the advantages of both RNNs and the multi-head attention mechanism while avoids their respective drawbacks and can effectively capture both local structures and global long-term dependencies in sequences without any use of position embeddings. Expand
End-To-End Memory Networks
A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Generating Long Sequences with Sparse Transformers
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more. Expand
Reducing Transformer Depth on Demand with Structured Dropout
LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance. Expand
Trellis Networks for Sequence Modeling
Trellis networks are presented, a new architecture for sequence modeling that outperform the current state of the art methods on a variety of challenging benchmarks, including word-level language modeling and character- level language modeling tasks, and stress tests designed to evaluate long-term memory retention. Expand