Corpus ID: 235422257

Memory-efficient Transformers via Top-k Attention

  title={Memory-efficient Transformers via Top-k Attention},
  author={Ankit Gupta and Guy Dar and Shaya Goodman and David Ciprut and Jonathan Berant},
Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention… Expand

Figures and Tables from this paper


Random Feature Attention
RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, is proposed and explored, showing that RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Expand
GMAT: Global Memory Augmentation for Transformers
This work proposes to augment sparse Transformer blocks with a dense attention-based $\textit{global memory}$ of length $M$ ($\ll L$) which provides an aggregate global view of the entire input sequence to each position, and empirically shows that this method leads to substantial improvement on a range of tasks. Expand
Linformer: Self-Attention with Linear Complexity
This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space. Expand
Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
This work proposes a novel model called Explicit Sparse Transformer, able to improve the concentration of attention on the global context through an explicit selection of the most relevant segments in the context, and achieves comparable or better results than the previous sparse attention method, but significantly reduces training and testing time. Expand
Reformer: The Efficient Transformer
This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences. Expand
BP-Transformer: Modelling Long-Range Context via Binary Partitioning
Ad adopting a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP), BP-Transformer (BPT for short) is proposed, which has a superior performance for long text than previous self-attention models. Expand
Augmenting Self-attention with Persistent Memory
A new model that solely consists of attention layers is proposed that augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Expand
Generating Long Sequences with Sparse Transformers
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand