• Corpus ID: 232134936

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

  title={Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth},
  author={Yihe Dong and Jean-Baptiste Cordonnier and Andreas Loukas},
Attention-based architectures have become ubiquitous in machine learning. Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms—or paths—each involving the operation of a sequence of attention heads across layers. Using this path decomposition, we prove that selfattention possesses a strong inductive bias towards “token uniformity… 

Figures from this paper

Not All Attention Is All You Need

This paper proposes a novel dropout method named AttendOut to let self-attention empowered PrLMs capable of more robust task-specific tuning and demonstrates that state-of-the-art models with elaborate training design may achieve much stronger results.

On the Expressive Power of Self-Attention Matrices

It is shown that the self-attention matrix can provably approximate sparse matrices, where sparsity is in terms of a bounded number of nonzero elements in each row and column, and that, in order to approximate any sparse matrix up to a given precision defined in Terms of preserving matrix element ratios, d grows only logarithmically with the sequence length L.

Sinkformers: Transformers with Doubly Stochastic Attention

This paper proposes to use Sinkhorn’s algorithm to make attention matrices doubly stochastic, and shows that Sinkformers enhance model accuracy in vision and natural language processing tasks and lead to a significant improvement on 3D shapes classi-cation.

Is Attention Better Than Matrix Decomposition?

A series of Hamburgers is proposed, in which the optimization algorithms for solving MDs are employed to factorize the input representations into sub-matrices and reconstruct a low-rank embedding to help design global information blocks.

Pruning Self-attentions into Convolutional Layers in Single Path

A novel weight-sharing scheme between MSA and convolutional operations is proposed, delivering a single-path space to encode all candidate operations and cast the operation search problem as choosing which subset of parameters to use in each MSA layer, which reduces the computational cost and optimization cost.

The Quarks of Attention

This work identifies and studies three most important mechanisms of attention: additive activation attention, multiplicative output attention (output gating), and multiplicative synaptic attention (synaptic gating).

Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence

A transformer architecture with a mitigatory self-attention mechanism by applying possible direct mapping connections in the transformer architecture to mitigate the rank collapse so as to counteract feature expression loss and enhance the model performance is proposed.

Rank Diminishing in Deep Neural Networks

This work theoretically establishes a universal monotonic decreasing property of network rank from the basic rules of differential and algebraic composition, and uncovers rank deficiency of network blocks and deep function coupling in deep neural networks.

AutoBERT-Zero: Evolving BERT Backbone from Scratch

This work makes the first attempt to automatically discover novel pre-trained language model (PLM) backbone on a flexible search space containing the most fundamental operations from scratch and proposes an Operation-Priority Neural Architecture Search (OP-NAS) algorithm, which optimizes both the search algorithm and evaluation of candidate models.

On The Computational Complexity of Self-Attention

It is proved that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false, which holds even if the attention computation is performed only approximately, and for a variety of attention mechanisms.



Linformer: Self-Attention with Linear Complexity

This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space.

Multi-Head Attention: Collaborate Instead of Concatenate

A collaborative multi-head attention layer that enables heads to learn shared projections and improves the computational cost and number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture.

Big Bird: Transformers for Longer Sequences

It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.

Batch normalization provably avoids ranks collapse for randomly initialised deep networks

This work highlights the fact that batch normalization is an effective strategy to avoid rank collapse for both linear and ReLU networks, and derives a meaningful lower rank bound in deep linear networks.

Synthesizer: Rethinking Self-Attention in Transformer Models

The true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models is investigated and a model that learns synthetic attention weights without token-token interactions is proposed, called Synthesizer.

The Shattered Gradients Problem: If resnets are the answer, then what is the question?

It is shown that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise whereas, in contrast, thegradients in architectures with skip-connections are far more resistant to shattering, decaying sublinearly.

Limits to Depth Efficiencies of Self-Attention

By identifying network width as a limiting factor, the analysis indicates that solutions for dramatically increasing the width can facilitate the next leap in self-attention expressivity.

On the Relationship between Self-Attention and Convolutional Layers

This work proves that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer, which provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice.

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

This work expresses the self-attention as a linear dot-product of kernel feature maps and makes use of the associativity property of matrix products to reduce the complexity from O(N) to N, where N is the sequence length.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.