• Corpus ID: 219966916

Memory Transformer

  title={Memory Transformer},
  author={Mikhail S. Burtsev and Grigory V. Sapunov},
Transformer-based models have achieved state-of-the-art results in many natural language processing (NLP) tasks. The self-attention architecture allows us to combine information from all elements of a sequence into context-aware representations. However, all-to-all attention severely hurts the scaling of the model to large sequences. Another limitation is that information about the context is stored in the same element-wise representations. This makes the processing of properties related to the… 

Figures and Tables from this paper

Current Limitations of Language Models: What You Need is Retrieval
It is argued that improving the performance-computes trade-off of language models can reduce the amount of supervision and efficiently extend the context over the entire training dataset and the entire past of the current sample, and (5) would resolve many of these limitations.
Linearizing Transformer with Key-Value Memory Bank
It is demonstrated that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer and other linear variants in language modeling and machine translation tasks, revealing a viable direction to-wards further inference ef-ciency improve-ment.
Linearizing Transformer with Key-Value Memory
It is demonstrated that MemSizer provides an improved balance between ef-ciency and accu-racy over the vanilla transformer and other efficient transformer variants in three typical sequence generation tasks, including machine translation, abstractive text summarization, and language modeling.
Memory transformer with hierarchical attention for long document processing
  • Arij Al AdelM. Burtsev
  • Computer Science
    2021 International Conference Engineering and Telecommunication (En&T)
  • 2021
A new version of transformer is introduced, a Sentence level transformer with global memory pooling and hierarchical attention to cope with long text and hypothesize that attaching memory slots to each sequence improves the quality of translation.
Memory-Augmented Transformer for Remote Sensing Image Semantic Segmentation
This paper proposes a memory-augmented transformer (MAT) to effectively model both the local and global information and demonstrates that the method can perform competitively with state-of-the-art methods.
1 Classification of Recent Language Model Approaches
  • 2020


Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
End-To-End Memory Networks
A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings.
ETC: Encoding Long and Structured Inputs in Transformers
A new Transformer architecture, Extended Transformer Construction (ETC), is presented that addresses two key challenges of standard Transformer architectures, namely scaling input length and encoding structured inputs.
ETC: Encoding Long and Structured Data in Transformers
A new family of Transformer models is presented, which is called the Extended Transformer Construction (ETC), that allows for significant increases in input sequence length by introducing a new globallocal attention mechanism between a global memory and the standard input tokens.
Longformer: The Long-Document Transformer
Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks.
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.
Non-autoregressive Machine Translation with Disentangled Context Transformer
An attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts that achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average.
Parallel Machine Translation with Disentangled Context Transformer
This work proposes an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts, and develops the parallel easy-first inference algorithm, which iteratively refines every token in parallel and reduces the number of required iterations.
Memory Networks
This work describes a new class of learning models called memory networks, which reason with inference components combined with a long-term memory component; they learn how to use these jointly.
Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes
This work presents an end-to-end differentiable memory access scheme, which they call Sparse Access Memory (SAM), that retains the representational power of the original approaches whilst training efficiently with very large memories, and achieves asymptotic lower bounds in space and time complexity.