• Corpus ID: 233864910

Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks

  title={Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks},
  author={Meng-Hao Guo and Zheng-Ning Liu and Tai-Jiang Mu and Shimin Hu},
Attention mechanisms, especially self-attention, have played an increasingly important role in deep feature representation for visual tasks. Self-attention updates the feature at each position by computing a weighted sum of features using pair-wise affinities across all positions to capture the long-range dependency within a single sample. However, self-attention has quadratic complexity and ignores potential correlation between different samples. This paper proposes a novel attention mechanism… 

Figures and Tables from this paper

TransLoc3D : Point Cloud based Large-scale Place Recognition using Adaptive Receptive Fields
A novel method named TransLoc3D is proposed, utilizing adaptive receptive fields with a point-wise reweighting scheme to handle objects of different sizes while suppressing noises, and an external transformer to capture long-range feature dependencies.
Excavating RoI Attention for Underwater Object Detection
This paper chooses the external attention module, a modified self-attention with reduced parameters, and proposes a double head structure and Positional Encoding module that can achieve promising performance in object detection.
S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision
This paper improves the S-MLP vision backbone by expanding the feature map along the channel dimension and split the expanded feature map into several parts, and exploiting the split-attention operation to fuse these split parts.
Visual Attention Network
A novel linear attention named large kernel attention (LKA) is proposed to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings and a neural network based on LKA is presented, namely Visual Attention Network (VAN).
MDMLP: Image Classification from Scratch on Small Datasets with MLP
A conceptually simple and lightweight MLP-based architecture yet achieves SOTA when training from scratch on small-size datasets; and a novel and efficient attention mechanism based on MLPs that high-lights objects in images, indicating its explanation power.
Deep Instance Segmentation with Automotive Radar Detection Points
An efficient method based on clustering of estimated semantic information to achieve instance segmentation for the sparse radar detection points and it is shown that the performance of the proposed approach can be further enhanced by incor- porating the visual multi-layer perceptron.
Can Attention Enable MLPs To Catch Up With CNNs?
A brief history of learning architectures, including multilayer perceptrons (MLPs), convolutional neural networks (CNNs) and transformers, and the views on challenges and directions for new learning architectures are given, hoping to inspire future research.
AGS-SSD: Attention-Guided Sampling for 3D Single-Stage Detector
An attention-guided downsampling method for point-cloud-based 3D object detection, named AGS-SSD, which achieves significant improvements with novel architectures against the baseline and runs at 24 frames per second for inference.
An Improved Tiered Head Pose Estimation Network with Self-Adjust Loss Function
A THESL-Net (tiered head pose estimation with self-adjustment loss network) model is proposed, gaining a greater freedom during angle estimation and outperforms the state-of-the-art approaches.
Eliminating Gradient Conflict in Reference-based Line-Art Colorization
This work proposes a novel attention mechanism using this training strategy, Stop-Gradient Attention (SGA), out-performing the attention baseline by a large margin with better training stability and demonstrates significant improvements in Fr´echet Inception Distance and structural similarity index measure on several benchmarks.


A2-Nets: Double Attention Networks
This work proposes the "double attention block", a novel component that aggregates and propagates informative global features from the entire spatio-temporal space of input images/videos, enabling subsequent convolution layers to access featuresFrom the entire space efficiently.
Expectation-Maximization Attention Networks for Semantic Segmentation
This paper forms the attention mechanism into an expectation-maximization manner and iteratively estimate a much more compact set of bases upon which the attention maps are computed, which is robust to the variance of input and is also friendly in memory and computation.
Dual Attention Network for Scene Segmentation
New state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset is achieved without using coarse data.
Squeeze-and-Attention Networks for Semantic Segmentation
A novel squeeze-and-attention network (SANet) architecture is proposed that leverages an effective squeeze- and-att attention (SA) module to account for two distinctive characteristics of segmentation: i) pixel-group attention, and ii) pixels-wise prediction.
Self-Attention Generative Adversarial Networks
The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset.
Rethinking Attention with Performers
Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear space and time complexity, without relying on any priors such as sparsity or low-rankness are introduced.
CCNet: Criss-Cross Attention for Semantic Segmentation
  • Zilong Huang, Xinggang Wang, Wenyu Liu
  • Computer Science
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
This work proposes a Criss-Cross Network (CCNet) for obtaining contextual information in a more effective and efficient way and achieves the mIoU score of 81.4 and 45.22 on Cityscapes test set and ADE20K validation set, respectively, which are the new state-of-the-art results.
A Survey on Visual Transformer
A literature review of these visual transformer models by categorizing them in different tasks and analyzing the advantages and disadvantages of these methods is provided.
Recurrent Models of Visual Attention
A novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution is presented.
Squeeze-and-Excitation Networks
This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.