Corpus ID: 219530577

Linformer: Self-Attention with Linear Complexity

@article{Wang2020LinformerSW,
  title={Linformer: Self-Attention with Linear Complexity},
  author={Sinong Wang and Belinda Z. Li and Madian Khabsa and Han Fang and Hao Ma},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.04768}
}
Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses $O(n^2)$ time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding… Expand
Memformer: The Memory-Augmented Transformer
An Attention Free Transformer
Combiner: Full Attention Transformer with Sparse Computation Cost
Luna: Linear Unified Nested Attention
Long-Short Transformer: Efficient Transformers for Language and Vision
Memory-efficient Transformers via Top-k Attention
PairConnect: A Compute-Efficient MLP Alternative to Attention
THG: Transformer with Hyperbolic Geometry
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 32 REFERENCES
Generating Long Sequences with Sparse Transformers
Reformer: The Efficient Transformer
Attention is All you Need
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Transformers with convolutional context for ASR
Training Deep Nets with Sublinear Memory Cost
Language Models are Unsupervised Multitask Learners
...
1
2
3
4
...