• Corpus ID: 207852415

On the Relationship between Self-Attention and Convolutional Layers

  title={On the Relationship between Self-Attention and Convolutional Layers},
  author={Jean-Baptiste Cordonnier and Andreas Loukas and Martin Jaggi},
Recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the supremacy of convolutional layers as a primary building block. Beyond helping CNNs to handle long-range dependencies, Ramachandran et al. (2019) showed that attention can completely replace convolution and achieve state-of-the-art performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work provides evidence that… 

Figures and Tables from this paper

[Re]: On the Relationship between Self-Attention and Convolutional Layers
A new variant of the attention operation Hierarchical Attention is proposed, which shows significantly improved performance with fewer parameters, hence validating the hypothesis that self attention layers can behave like convolutional layers.
Transformed CNNs: recasting pre-trained convolutional layers with self-attention
The idea of reducing the time spent training self-attention layers by initializing them as convolutional layers enables the transition smoothly from any pretrained CNN to its functionally identical hybrid model, called Transformed CNN (T-CNN).
Can Vision Transformers Perform Convolution?
This work proves that a single ViT layer with image patches as the input can perform any convolution operation constructively, where the multi-head attention mechanism and the relative positional encoding play essential roles.
Pruning Self-attentions into Convolutional Layers in Single Path
A novel weight-sharing scheme between MSA and convolutional operations is proposed, delivering a single-path space to encode all candidate operations and cast the operation search problem as finding which subset of parameters to use in each MSA layer, which significantly reduces the computational cost and optimization difficulty.
Less is More: Pay Less Attention in Vision Transformers
A hierarchical Transformer where pure multi-layer perceptrons (MLPs) are used to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers is proposed.
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
GPSA is introduced, a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias and outperforms the DeiT on ImageNet, while offering a much improved sample efficiency.
Scaling Local Self-Attention for Parameter Efficient Visual Backbones
A new self-attention model family, HaloNets, is developed which reach state-of-the-art accuracies on the parameter-limited setting of the ImageNet classification benchmark, and preliminary transfer learning experiments find that HaloNet models outperform much larger models and have better inference performance.
RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?
The small model that is RaftMLP-S is comparable to the state-of-the-art global MLP-based model in terms of parameters and efficiency per calculation and the problem of fixed input image resolution for global MLPs-based models is tackled by utilizing bicubic interpolation.
Do Vision Transformers See Like Convolutional Neural Networks?
Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, there are striking differences between the two architectures, such as ViT having more uniform representations across all layers and ViT residual connections, which strongly propagate features from lower to higher layers.
KVT: k-NN Attention for Boosting Vision Transformers
A sparse attention scheme, dubbed k-NN attention, which naturally inherits the local bias of CNNs without introducing convolutional operations, and allows for the exploration of long range correlation and filter out irrelevant tokens by choosing the most similar tokens from the entire image.


This work aims to learn structurally-sparse Long Short-Term Memory by reducing the sizes of basic structures within LSTM units, including input updates, gates, hidden states, cell states and outputs, by proposing Intrinsic Sparse Structures (ISS) in LSTMs.
Unsupervised Scalable Representation Learning for Multivariate Time Series
This paper combines an encoder based on causal dilated convolutions with a novel triplet loss employing time-based negative sampling, obtaining general-purpose representations for variable length and multivariate time series.