Corpus ID: 207852415

On the Relationship between Self-Attention and Convolutional Layers

  title={On the Relationship between Self-Attention and Convolutional Layers},
  author={Jean-Baptiste Cordonnier and Andreas Loukas and Martin Jaggi},
Recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the supremacy of convolutional layers as a primary building block. Beyond helping CNNs to handle long-range dependencies, Ramachandran et al. (2019) showed that attention can completely replace convolution and achieve state-of-the-art performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work provides evidence that… Expand
[Re]: On the Relationship between Self-Attention and Convolutional Layers
In this report, we perform a detailed study on the paper "On the Relationship between Self-Attention and Convolutional Layers", which provides theoretical and experimental evidence that selfExpand
Transformed CNNs: recasting pre-trained convolutional layers with self-attention
The idea of reducing the time spent training self-attention layers by initializing them as convolutional layers enables the transition smoothly from any pretrained CNN to its functionally identical hybrid model, called Transformed CNN (T-CNN). Expand
Less is More: Pay Less Attention in Vision Transformers
A hierarchical Transformer where pure multi-layer perceptrons (MLPs) are used to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers is proposed. Expand
Do Vision Transformers See Like Convolutional Neural Networks?
Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superiorExpand
ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural Networks
This work is the first attempt that uses a subspace attention mechanism to increase the efficiency of compact CNNs, and argues that leaning separate attention maps for each feature subspace enables multi-scale and multi-frequency feature representation, which is more desirable for fine-grained image classification. Expand
KVT: k-NN Attention for Boosting Vision Transformers
A sparse attention scheme, dubbed k-NN attention, which naturally inherits the local bias of CNNs without introducing convolutional operations, and allows for the exploration of long range correlation and at the same time filters out irrelevant tokens by choosing the most similar tokens from the entire image. Expand
X-volution: On the unification of convolution and self-attention
This work theoretically derive a global self-attention approximation scheme, which approximates self-Attention via the convolution operation on transformed features, and establishes a multi-branch elementary module composed of both convolution and self-ATTention operation, capable of unifying both local and non-local feature interaction. Expand
Multi-Head Attention: Collaborate Instead of Concatenate
A collaborative multi-head attention layer that enables heads to learn shared projections and improves the computational cost and number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture. Expand
RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?
It is indicated that MLP-based models have the potential to replace CNNs by adopting inductive bias and the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. Expand
Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models
Composite attention is proposed, which unites previous relative position embedding methods under a convolutional framework, and finds that convolutions consistently improve performance on multiple downstream tasks, replacing absolute position embeddings. Expand


Model compression is significant for the wide adoption of Recurrent Neural Networks (RNNs) in both user devices possessing limited resources and business clusters requiring quick responses toExpand
Unsupervised Scalable Representation Learning for Multivariate Time Series
This paper combines an encoder based on causal dilated convolutions with a novel triplet loss employing time-based negative sampling, obtaining general-purpose representations for variable length and multivariate time series. Expand