Visformer: The Vision-friendly Transformer

  title={Visformer: The Vision-friendly Transformer},
  author={Zhengsu Chen and Lingxi Xie and Jianwei Niu and Xuefeng Liu and Longhui Wei and Qi Tian},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  • Zhengsu ChenLingxi Xie Qi Tian
  • Published 26 April 2021
  • Computer Science
  • 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
The past year has witnessed the rapid development of applying the Transformer module to vision problems. While some researchers have demonstrated that Transformer-based models enjoy a favorable ability of fitting data, there are still growing number of evidences showing that these models suffer over-fitting especially when the training data is limited. This paper offers an empirical study by performing step-by-step operations to gradually transit a Transformer-based model to a convolution-based… 

Out of Distribution Performance of State of Art Vision Model

This study investigates the performance of 58 state-of-the-art computer vision models in a unified training setup based not only on attention and convolution mechanisms but also on neural networks based on a combination of convolution and attention mechanisms, sequence-based model, complementary search, and network-based method.

A Survey on Vision Transformer

  • Kai HanYunhe Wang D. Tao
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2022
This paper reviews these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages, and takes a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer.

Effective Vision Transformer Training: A Data-Centric Perspective

A novel data-centric ViT training framework is proposed to dynamically measure the “difficulty” of training samples and generate “effective” samples for models at different training stages to address two critical questions: how to measure the "effectiveness" of individual training examples, and how to systematically generate enough number of “ effective” examples when they are running out.

Using CNN to improve the performance of the Light-weight ViT

  • Xiaopeng LiShuqin Li
  • Computer Science
    2022 International Joint Conference on Neural Networks (IJCNN)
  • 2022
The proposed Multiscale Patch Embedding (MPE) module based on convolutional neural network can provide higher-level features for the ViT module and can improve the performance of ViT-based light-weight models such as Swin and DeiT by 2.9%-15.4%.

Transformers Meet Visual Learning Understanding: A Comprehensive Review

This review mainly investigates the current research progress of Transformer in image and video applications, which makes a comprehensive overview of Trans transformer in visual learning understanding.

Regularizing self-attention on vision transformers with 2D spatial distance loss

A self-attention regularization mechanism based on two-dimensional distance information on an image with a new loss function, denoted Distance Loss, formulated specifically for the transformer encoder is proposed, which outperforms a similar capacity Vision Transformer by large margins on all tasks.

Enhancing Mask Transformer with Auxiliary Convolution Layers for Semantic Segmentation

This work proposes a simple yet effective architecture that introduces auxiliary branches to Mask2Former during training to capture dense local features on the encoder side and can achieve state-of-the-art performance on the ADE20K and Cityscapes datasets.

Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

This work proposes a simple but still effective Self-Supervised Learning (SSL) strategy to train ViTs, that without any external annotation or external data, can significantly improve the results.

Part-based Face Recognition with Vision Transformers

By learning to extract discriminative patches, the part-based Transformer further boosts the accuracy of the Vision Transformer baseline achieving state-of-the-art accuracy on several face recognition benchmarks.

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

The Dynamic Hybrid Vision Transformer (DHVT) is proposed, in which convolution is integrated into patch embedding and multi-layer perceptron module, forcing the model to capture the token features as well as their neighboring features, and the performance gap between CNNs and ViTs is eliminated.



AutoFormer: Searching Transformers for Visual Recognition

This work proposes a new one-shot architecture search framework, namely AutoFormer, dedicated to vision transformer search, which surpass the recent state-of-the-arts such as ViT and DeiT and achieves top-1 accuracy on ImageNet.

CvT: Introducing Convolutions to Vision Transformers

A new architecture is presented that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both de-signs, and the positional encoding, a crucial component in existing Vision Transformers, can be safely re-moved in this model.

Pre-Trained Image Processing Transformer

To maximally excavate the capability of transformer, the IPT model is presented to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs and the contrastive learning is introduced for well adapting to different image processing tasks.

Scaling Vision Transformers

A ViT model with two billion parameters is successfully trained, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy and performs well for few-shot transfer.

Going deeper with Image Transformers

This work builds and optimize deeper transformer networks for image classification and investigates the interplay of architecture and optimization of such dedicated transformers, making two architecture changes that significantly improve the accuracy of deep transformers.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

  • Ze LiuYutong Lin B. Guo
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
A hierarchical Transformer whose representation is computed with Shifted windows, which has the flexibility to model at various scales and has linear computational complexity with respect to image size and will prove beneficial for all-MLP architectures.

PVTv2: Improved Baselines with Pyramid Vision Transformer

This work improves the original Pyramid Vision Transformer (PVT v1) by adding three designs: a linear complexity attention layer, an overlapping patch embedding, and a convolutional feed-forward network to reduce the computational complexity of PVT v1 to linearity and provide significant improvements on fundamental vision tasks.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

The Pyramid Vision Transformer (PVT) is introduced, which overcomes the difficulties of porting Transformer to various dense prediction tasks and is validated through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation.

CoAtNet: Marrying Convolution and Attention for All Data Sizes

CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights that vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency.

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

A systematic empirical study finds that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data.