Corpus ID: 232134936

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

@article{Dong2021AttentionIN,
  title={Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth},
  author={Yihe Dong and Jean-Baptiste Cordonnier and Andreas Loukas},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.03404}
}
Attention-based architectures have become ubiquitous in machine learning. Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms—or paths—each involving the operation of a sequence of attention heads across layers. Using this path decomposition, we prove that selfattention possesses a strong inductive bias towards “token uniformity… Expand

Figures from this paper

Not All Attention Is All You Need
TLDR
It is demonstrated that the lighter state-of-the-art models with nearly 80% of self-attention layers pruned, may achieve even better results on multiple tasks, including natural language understanding, document classification, named entity recognition and POS tagging, with nearly twice faster inference. Expand
Is Attention Better Than Matrix Decomposition?
TLDR
A series of Hamburgers is proposed, in which the optimization algorithms for solving MDs are employed to factorize the input representations into sub-matrices and reconstruct a low-rank embedding to help design global information blocks. Expand
EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation
  • Chenhe Dong, Guangrun Wang, Hang Xu, Jiefeng Peng, Xiaozhe Ren, Xiaodan Liang
  • Computer Science
  • ArXiv
  • 2021
Pre-trained language models have shown remarkable results on various NLP tasks. Nevertheless, due to their bulky size and slow inference speed, it is hard to deploy them on edge devices. In thisExpand
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures
TLDR
This work demonstrates a set of modifications to the structure of a Transformer layer, producing a more efficient architecture, and applies the resulting architecture to language representation learning and demonstrates its superior performance compared to BERT models of different scales. Expand
AutoBERT-Zero: Evolving BERT Backbone from Scratch
  • Jiahui Gao, Hang Xu, +5 authors Zhenguo Li
  • Computer Science
  • ArXiv
  • 2021
TLDR
This work proposes an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures and designs a Bi-branch Weight-Sharing (BIWS) training strategy for fast model evaluation. Expand
Incorporating Residual and Normalization Layers into Analysis of Masked Language Models
Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However,Expand
Refiner: Refining Self-attention for Vision Transformers
TLDR
This work introduces a conceptually simple scheme, called refiner, to directly refine the selfattention maps of ViTs, and explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Expand
Bidirectional Attention Flow with Self-Attention
  • 2021
T extended the BiDAF model with varies optimization techniques on the SQUAD 2.0 dataset. With character embedding and multi head self attention been added to the model, my results shows anExpand
Generative Flows with Invertible Attentions
TLDR
This paper proposes map-based and scaled dot-product attention for unconditional and conditional generative flow models, to exploit split-based attention mechanisms to learn the attention weights and input representations on every two splits of flow feature maps. Expand
Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms
TLDR
The results on ImageNet and COCO show the superiority of LESA over convolution and self-attention baselines for the tasks of image recognition, object detection, and instance segmentation. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 58 REFERENCES
Linformer: Self-Attention with Linear Complexity
TLDR
This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space. Expand
Multi-Head Attention: Collaborate Instead of Concatenate
TLDR
A collaborative multi-head attention layer that enables heads to learn shared projections and improves the computational cost and number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture. Expand
Synthesizer: Rethinking Self-Attention in Transformer Models
TLDR
The true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models is investigated and a model that learns synthetic attention weights without token-token interactions is proposed, called Synthesizer. Expand
The Shattered Gradients Problem: If resnets are the answer, then what is the question?
TLDR
It is shown that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise whereas, in contrast, thegradients in architectures with skip-connections are far more resistant to shattering, decaying sublinearly. Expand
Limits to Depth Efficiencies of Self-Attention
TLDR
By identifying network width as a limiting factor, the analysis indicates that solutions for dramatically increasing the width can facilitate the next leap in self-attention expressivity. Expand
On the Relationship between Self-Attention and Convolutional Layers
TLDR
This work proves that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer, which provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Attention Augmented Convolutional Networks
TLDR
It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar. Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
Identity Mappings in Deep Residual Networks
TLDR
The propagation formulations behind the residual building blocks suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. Expand
...
1
2
3
4
5
...