Corpus ID: 235422257

Memory-efficient Transformers via Top-k Attention

@article{Gupta2021MemoryefficientTV,
  title={Memory-efficient Transformers via Top-k Attention},
  author={Ankit Gupta and Guy Dar and Shaya Goodman and David Ciprut and Jonathan Berant},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.06899}
}
Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 56 REFERENCES
Random Feature Attention
GMAT: Global Memory Augmentation for Transformers
Linformer: Self-Attention with Linear Complexity
Reformer: The Efficient Transformer
BP-Transformer: Modelling Long-Range Context via Binary Partitioning
Augmenting Self-attention with Persistent Memory
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Generating Long Sequences with Sparse Transformers
Attention is All you Need
...
1
2
3
4
5
...