• Corpus ID: 245353696

Lite Vision Transformer with Enhanced Self-Attention

  title={Lite Vision Transformer with Enhanced Self-Attention},
  author={Chenglin Yang and Yilin Wang and Jianming Zhang and He Zhang and Zijun Wei and Zhe L. Lin and Alan Loddon Yuille},
Despite the impressive representation capacity of vision transformer models, current light-weight vision transformer models still suffer from inconsistent and incorrect dense predictions at local regions. We suspect that the power of their self-attention mechanism is limited in shallower and thinner networks. We propose Lite Vision Transformer (LVT), a novel light-weight transformer network with two enhanced self-attention mechanisms to improve the model performances for mobile deployment. For… 
A Closer Look at Self-supervised Lightweight Vision Transformers
This work mainly produces recipes for pre-training high-performance lightweight ViTs using masked-image-modeling-based MAE, namely MAE-lite, and reveals that properly-learned lower layers of the pre-trained models matter more than higher ones in data-sufficient downstream tasks.
EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm
A novel pyramid EATFormer backbone that only contains the proposed EA-based Transformer (EAT) block is proposed, which consists of three residual parts, i.e., Multi-Scale Region Aggregation (MSRA), Global and Local Interaction (GLI), and Feed-Forward Network (FFN) modules, to model multi-scale, interactive, and individual information separately.
Learning Target-aware Representation for Visual Tracking via Informative Interactions
A general interaction modeler (GIM) that injects the prior knowledge of reference image to different stages of the backbone network, leading to better target-perception and robust distractor-resistance of candidate feature representation with negligible computation cost is introduced.
HierAttn: Effectively Learn Representations from Stage Attention and Branch Attention for Skin Lesions Diagnosis
HierAttn is introduced, a lite and hierarchical neural network with hierarchical and self attention based on learning local and global features by a multi-stage and hierarchical network that achieves the best top-1 accuracy and AUC among state-of-the-art mobile networks.


DeepViT: Towards Deeper Vision Transformer
This paper proposes a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost and makes it feasible to train deeper ViTs with consistent performance improvements via minor modification to existing ViT models.
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
This paper proposes a dual-branch transformer to com-bine image patches of different sizes to produce stronger image features and develops a simple yet effective token fusion module based on cross attention which uses a single token for each branch as a query to exchange information with other branches.
CvT: Introducing Convolutions to Vision Transformers
A new architecture is presented that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both de-signs, and the positional encoding, a crucial component in existing Vision Transformers, can be safely re-moved in this model.
Twins: Revisiting the Design of Spatial Attention in Vision Transformers
This work revisits the design of the spatial attention and demonstrates that a carefully devised yet simple spatial attention mechanism performs favorably against the state-of-the-art schemes.
PVTv2: Improved Baselines with Pyramid Vision Transformer
This work improves the original Pyramid Vision Transformer (PVT v1) by adding three designs: a linear complexity attention layer, an overlapping patch embedding, and a convolutional feed-forward network to reduce the computational complexity of PVT v1 to linearity and provide significant improvements on fundamental vision tasks.
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
The Pyramid Vision Transformer (PVT) is introduced, which overcomes the difficulties of porting Transformer to various dense prediction tasks and is validated through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation.
VOLO: Vision Outlooker for Visual Recognition
A novel outlook attention is introduced and presented, termed Vision Outlooker (VOLO), which efficiently encodes finer-level features and contexts into tokens, which is shown to be critically beneficial to recognition performance but largely ignored by the self-attention.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
  • Ze Liu, Yutong Lin, B. Guo
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
A hierarchical Transformer whose representation is computed with Shifted windows, which has the flexibility to model at various scales and has linear computational complexity with respect to image size and will prove beneficial for all-MLP architectures.
Stand-Alone Self-Attention in Vision Models
The results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox and is especially impactful when used in later layers.
Co-Scale Conv-Attentional Image Transformers
Co-scale conv-attentional image Transformers (CoaT), a Transformer-based image classifier equipped with co-scale and convolution-like mechanisms, empowers image Transformers with enriched multi- scale and contextual modeling capabilities.