• Corpus ID: 233394097

Visformer: The Vision-friendly Transformer

@article{Chen2021VisformerTV,
  title={Visformer: The Vision-friendly Transformer},
  author={Zhengsu Chen and Lingxi Xie and Jianwei Niu and Xuefeng Liu and Longhui Wei and Qi Tian},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.12533}
}
The past year has witnessed the rapid development of applying the Transformer module to vision problems. While some researchers have demonstrated that Transformerbased models enjoy a favorable ability of fitting data, there are still growing number of evidences showing that these models suffer over-fitting especially when the training data is limited. This paper offers an empirical study by performing step-by-step operations to gradually transit a Transformer-based model to a convolution-based… 
ViR: the Vision Reservoir
  • Xian Wei, Bin Wang, +7 authors Dongping Yang
  • Computer Science
    ArXiv
  • 2021
TLDR
The novel method, Vision Reservoir computing (ViR), is proposed here for image classification, as a parallel to ViT, and without any pre-training process, the ViR outperforms the ViT in terms of both model and computational complexity.
A Survey on Vision Transformer
TLDR
This paper reviews these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages, and takes a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer.
KVT: k-NN Attention for Boosting Vision Transformers
TLDR
A sparse attention scheme, dubbed k-NN attention, which naturally inherits the local bias of CNNs without introducing convolutional operations, and allows for the exploration of long range correlation and filter out irrelevant tokens by choosing the most similar tokens from the entire image.
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
TLDR
The Cross-Shaped Window self-attention mechanism for computing self-Attention in the horizontal and vertical stripes in parallel that form a cross-shaped window is developed, with each stripe obtained by splitting the input feature into stripes of equal width.
Scaled ReLU Matters for Training Vision Transformers
TLDR
It is verified, both theoretically and empirically, that scaled ReLU in the conv-stem matters for the robust ViTs training and not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops.
MPViT: Multi-Path Vision Transformer for Dense Prediction
TLDR
This work explores multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT), which consistently achieve superior performance over state-of-theart Vision Transformers on ImageNet classification, object detection, instance segmentation, and semantic segmentation.
Complementary Feature Enhanced Network with Vision Transformer for Image Dehazing
  • Dong Zhao, Jia Li, Hongyu Li, Long Xu
  • Computer Science
  • 2021
TLDR
A new complementary feature enhanced framework, in which the complementary features are learned by several complementary subtasks and then together serve to boost the performance of the primary task, and a new dehazing network is designed based on such a framework.
Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training
TLDR
This paper strives to liberate ViTs from pre-training by introducing CNNs’ inductive biases back to ViTs while preserving their network architectures for higher upper bound and setting up more suitable optimization objectives.
ELSA: Enhanced Local Self-Attention for Vision Transformer
  • Jingkai Zhou, Pichao Wang, Fan Wang, Qiong Liu, Hao Li, Rong Jin
  • Computer Science
    ArXiv
  • 2021
TLDR
The devil lies in the generation and application of spatial attention, where relative position embeddings and the neighboring filter application are key factors and the enhanced local self-attention (ELSA) with Hadamard attention and the ghost head is proposed.
Early Convolutions Help Transformers See Better
TLDR
This work conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p×p convolution applied to the input image, which runs counter to typical design choices of convolutional layers in neural networks.
...
1
2
...

References

SHOWING 1-10 OF 69 REFERENCES
AutoFormer: Searching Transformers for Visual Recognition
TLDR
This work proposes a new one-shot architecture search framework, namely AutoFormer, dedicated to vision transformer search, which surpass the recent state-of-the-arts such as ViT and DeiT and achieves top-1 accuracy on ImageNet.
CvT: Introducing Convolutions to Vision Transformers
TLDR
A new architecture, named Convolutional vision Transformer (CvT), is presented, that improves Vision Trans transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs.
Scaling Vision Transformers
TLDR
A ViT model with two billion parameters is successfully trained, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy and performs well on few-shot learning.
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Going deeper with Image Transformers
TLDR
This work builds and optimize deeper transformer networks for image classification and investigates the interplay of architecture and optimization of such dedicated transformers, making two architecture changes that significantly improve the accuracy of deep transformers.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
TLDR
A new vision Transformer is presented that capably serves as a general-purpose backbone for computer vision and has the flexibility to model at various scales and has linear computational complexity with respect to image size.
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
TLDR
This study trains ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
PVTv2: Improved Baselines with Pyramid Vision Transformer
TLDR
This work presents new baselines by improving the original Pyramid Vision Transformer (abbreviated as PVTv1) by adding three designs, including (1) overlapping patch embedding, (2) convolutional feedforward networks, and (3) linear complexity attention layers.
CoAtNet: Marrying Convolution and Attention for All Data Sizes
TLDR
CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights: depthwise Convolution and self-Attention can be naturally unified via simple relative attention and vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency.
Bottleneck Transformers for Visual Recognition
TLDR
BoTNet is presented, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation, and a simple adaptation of the BoTNet design for image classification is presented.
...
1
2
3
4
5
...