• Corpus ID: 233864910

Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks

@article{Guo2021BeyondSE,
  title={Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks},
  author={Meng-Hao Guo and Zheng-Ning Liu and Tai-Jiang Mu and Shimin Hu},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.02358}
}
Attention mechanisms, especially self-attention, have played an increasingly important role in deep feature representation for visual tasks. Self-attention updates the feature at each position by computing a weighted sum of features using pair-wise affinities across all positions to capture the long-range dependency within a single sample. However, self-attention has quadratic complexity and ignores potential correlation between different samples. This paper proposes a novel attention mechanism… 

Figures and Tables from this paper

TransLoc3D : Point Cloud based Large-scale Place Recognition using Adaptive Receptive Fields
TLDR
A novel method named TransLoc3D is proposed, utilizing adaptive receptive fields with a point-wise reweighting scheme to handle objects of different sizes while suppressing noises, and an external transformer to capture long-range feature dependencies.
Excavating RoI Attention for Underwater Object Detection
TLDR
This paper chooses the external attention module, a modified self-attention with reduced parameters, and proposes a double head structure and Positional Encoding module that can achieve promising performance in object detection.
S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision
TLDR
This paper improves the S-MLP vision backbone by expanding the feature map along the channel dimension and split the expanded feature map into several parts, and exploiting the split-attention operation to fuse these split parts.
Visual Attention Network
TLDR
A novel large kernel attention (LKA) module is proposed to enable self-adaptive and long-range correlations in self-attention while avoiding the above issues and a novel neural network based on LKA is introduced, namely Visual Attention Network (VAN).
MDMLP: Image Classification from Scratch on Small Datasets with MLP
TLDR
A conceptually simple and lightweight MLP-based architecture yet achieves SOTA when training from scratch on small-size datasets; and a novel and efficient attention mechanism based on MLPs that high-lights objects in images, indicating its explanation power.
Deep Instance Segmentation with Automotive Radar Detection Points
TLDR
An efficient method based on clustering of estimated semantic information to achieve instance segmentation for the sparse radar detection points and it is shown that the performance of the proposed approach can be further enhanced by incor- porating the visual multi-layer perceptron.
Can Attention Enable MLPs To Catch Up With CNNs?
TLDR
A brief history of learning architectures, including multilayer perceptrons (MLPs), convolutional neural networks (CNNs) and transformers, and the views on challenges and directions for new learning architectures are given, hoping to inspire future research.
Enhanced Attention Framework for Multi-Interest Sequential Recommendation
TLDR
This work proposes an Enhanced Attention (EA) framework, which is based on two linear layers and two norm layers, which not only reduces the high computational complexity, but also obtains the correlation between different samples.
A Dual-fusion Semantic Segmentation Framework With GAN For SAR Images
TLDR
A network based on the widely used encoder-decoder architecture is proposed to accomplish the synthetic aperture radar (SAR) images segmentation via the generative adversative network (GAN) trained by numerous SAR and optical images.
High-resolution image-based surface defect detection method for hot-rolled strip steel
TLDR
The ResNet152 model is improved by adding a feature extraction layer at the front of the model so that it can extract information more effectively for high-resolution images.
...
...

References

SHOWING 1-10 OF 103 REFERENCES
A2-Nets: Double Attention Networks
TLDR
This work proposes the "double attention block", a novel component that aggregates and propagates informative global features from the entire spatio-temporal space of input images/videos, enabling subsequent convolution layers to access featuresFrom the entire space efficiently.
Expectation-Maximization Attention Networks for Semantic Segmentation
TLDR
This paper forms the attention mechanism into an expectation-maximization manner and iteratively estimate a much more compact set of bases upon which the attention maps are computed, which is robust to the variance of input and is also friendly in memory and computation.
Dual Attention Network for Scene Segmentation
TLDR
New state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset is achieved without using coarse data.
Squeeze-and-Attention Networks for Semantic Segmentation
TLDR
A novel squeeze-and-attention network (SANet) architecture is proposed that leverages an effective squeeze- and-att attention (SA) module to account for two distinctive characteristics of segmentation: i) pixel-group attention, and ii) pixels-wise prediction.
Self-Attention Generative Adversarial Networks
TLDR
The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset.
Rethinking Attention with Performers
TLDR
Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear space and time complexity, without relying on any priors such as sparsity or low-rankness are introduced.
CCNet: Criss-Cross Attention for Semantic Segmentation
TLDR
This work proposes a Criss-Cross Network (CCNet) for obtaining contextual information in a more effective and efficient way and achieves the mIoU score of 81.4 and 45.22 on Cityscapes test set and ADE20K validation set, respectively, which are the new state-of-the-art results.
A Survey on Visual Transformer
TLDR
A literature review of these visual transformer models by categorizing them in different tasks and analyzing the advantages and disadvantages of these methods is provided.
Recurrent Models of Visual Attention
TLDR
A novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution is presented.
Squeeze-and-Excitation Networks
TLDR
This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.
...
...