Corpus ID: 221702858

Efficient Transformers: A Survey

@article{Tay2020EfficientTA,
  title={Efficient Transformers: A Survey},
  author={Yi Tay and M. Dehghani and Dara Bahri and Donald Metzler},
  journal={ArXiv},
  year={2020},
  volume={abs/2009.06732}
}
Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example, Transformers have become an indispensable staple in the modern deep learning stack. Recently, a dizzying number of "X-former" models have been proposed - Reformer, Linformer, Performer, Longformer, to name a few - which improve upon the original Transformer… Expand

Figures and Tables from this paper

GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures
TLDR
This work demonstrates a set of modifications to the structure of a Transformer layer, producing a more efficient architecture, and applies the resulting architecture to language representation learning and demonstrates its superior performance compared to BERT models of different scales. Expand
THG: Transformer with Hyperbolic Geometry
TLDR
This work proposes a novel Transformer withHyperbolic Geometry (THG) model, which take the advantage of both Euclidean space and Hyperbolic space, and makes improvements in linear transformations of self-attention. Expand
Finetuning Pretrained Transformers into RNNs
TLDR
This work proposes a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, the softmax attention is replaced with its linear-complexity recurrent alternative and then finetune, which provides an improved tradeoff between efficiency and accuracy over the standard transformer and other recurrent variants. Expand
An Attention Free Transformer
TLDR
Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention, is introduced and demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time. Expand
Efficient pre-training objectives for Transformers
TLDR
The experiments show that it is possible to efficiently train BERT-like models using a discriminative approach as in ELECTRA but without a complex generator and that ELECTRA largely benefits from a deep hyper-parameter search. Expand
Thinking Like Transformers
TLDR
This paper proposes a computational model for the transformer-encoder in the form of a programming language, the Restricted Access Sequence Processing Language (RASP), and shows how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer. Expand
Random Feature Attention
TLDR
RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, is proposed and explored, showing that RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Expand
Chasing Sparsity in Vision Transformers: An End-to-End Exploration
TLDR
This paper launches and reports the first-ofits-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs “from end to end" by dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Expand
H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
TLDR
This work describes an efficient hierarchical method to compute attention in the Transformer architecture that exploits a matrix structure similar to the Hierarchical Matrix developed by the numerical analysis community, and has linear run time and memory complexity. Expand
∞-former: Infinite Memory Transformer
Transformers struggle when attending to long contexts, since the amount of computation grows with the context length, and therefore they cannot model long-term memories effectively. SeveralExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 56 REFERENCES
Linformer: Self-Attention with Linear Complexity
TLDR
This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space. Expand
Reformer: The Efficient Transformer
TLDR
This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences. Expand
Compressive Transformers for Long-Range Sequence Modelling
TLDR
The Compressive Transformer is presented, an attentive sequence model which compresses past memories for long-range sequence learning and can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. Expand
Universal Transformers
TLDR
The Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses issues of parallelizability and global receptive field, is proposed. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
The Evolved Transformer
TLDR
The Progressive Dynamic Hurdles method is developed, which allows us to dynamically allocate more resources to more promising candidate models on the computationally expensive WMT 2014 English-German translation task, and demonstrates consistent improvement over the Transformer on four well-established language tasks. Expand
Augmenting Self-attention with Persistent Memory
TLDR
A new model that solely consists of attention layers is proposed that augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Expand
Axial Attention in Multidimensional Transformers
TLDR
Axial Transformers is proposed, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors that maintains both full expressiveness over joint distributions over data and ease of implementation with standard deep learning frameworks, while requiring reasonable memory and computation. Expand
Generating Long Sequences with Sparse Transformers
TLDR
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more. Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
...
1
2
3
4
5
...