Corpus ID: 219530577

Linformer: Self-Attention with Linear Complexity

@article{Wang2020LinformerSW,
  title={Linformer: Self-Attention with Linear Complexity},
  author={Sinong Wang and Belinda Z. Li and Madian Khabsa and Han Fang and Hao Ma},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.04768}
}
Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses $O(n^2)$ time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding… Expand
Memformer: The Memory-Augmented Transformer
TLDR
Results show that Memformer outperforms the previous long-range sequence models on WikiText-103, including Transformer-XL and compressive Transformer, and is also compatible with other self-supervised tasks to further improve the performance on language modeling. Expand
H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
TLDR
This work describes an efficient hierarchical method to compute attention in the Transformer architecture that exploits a matrix structure similar to the Hierarchical Matrix developed by the numerical analysis community, and has linear run time and memory complexity. Expand
Adaptive Multi-Resolution Attention with Linear Complexity
TLDR
A novel and efficient structure named Adaptive Multi-Resolution Attention (AdaMRA for short), which scales linearly to sequence length in terms of time and space, and leverages a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion. Expand
An Attention Free Transformer
TLDR
Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention, is introduced and demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time. Expand
Random Feature Attention
TLDR
RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, is proposed and explored, showing that RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Expand
Combiner: Full Attention Transformer with Sparse Computation Cost
TLDR
Combiner is a drop-in replacement for attention layers in existing transformers and can be easily implemented in common frameworks, and shows that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. Expand
Luna: Linear Unified Nested Attention
TLDR
Luna is proposed, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions, yielding only linear time and space complexity. Expand
Long-Short Transformer: Efficient Transformers for Language and Vision
TLDR
The proposed Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks, aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. Expand
Synthesizer: Rethinking Self-Attention in Transformer Models
TLDR
The true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models is investigated and a model that learns synthetic attention weights without token-token interactions is proposed, called Synthesizer. Expand
Memory-efficient Transformers via Top-k Attention
TLDR
This work proposes a simple yet highly accurate approximation for vanilla attention that leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 32 REFERENCES
Generating Long Sequences with Sparse Transformers
TLDR
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more. Expand
Reformer: The Efficient Transformer
TLDR
This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences. Expand
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
TLDR
BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks. Expand
Transformers with convolutional context for ASR
TLDR
This paper proposes replacing the sinusoidal positional embedding for transformers with convolutionally learned input representations that provide subsequent transformer blocks with relative positional information needed for discovering long-range relationships between local concepts. Expand
Training Deep Nets with Sublinear Memory Cost
TLDR
This work designs an algorithm that costs O( √ n) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch, and shows that it is possible to trade computation for memory giving a more memory efficient training algorithm with a little extra computation cost. Expand
RoBERTa: A Robustly Optimized BERT Pretraining Approach
TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD. Expand
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
...
1
2
3
4
...