• Corpus ID: 235367962

Scaling Vision Transformers

@article{Zhai2021ScalingVT,
  title={Scaling Vision Transformers},
  author={Xiaohua Zhai and Alexander Kolesnikov and Neil Houlsby and Lucas Beyer},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.04560}
}
Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, under-standing a model’s scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer language models have been studied, it is un-known how Vision Transformers scale. To address this, we scale ViT models and data, both up… 

A Survey of Visual Transformers

TLDR
This survey has reviewed over one hundred of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, and proposed the deformable attention module which combines the best of the sparse spatial sampling of deformable convo- lution, and the relation modeling capability of Transformers.

Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block

TLDR
This work follows a simple yet restrictive method for fine-tuning both CNN and Transformer models pretrained on ImageNet1K on CIFAR-10 and compare them with each other to understand which architecture is better when applied to real world problems with small data.

VOLO: Vision Outlooker for Visual Recognition

TLDR
A new simple and generic architecture, termed Vision Outlooker (VOLO), which implements a novel outlook attention operation that dynamically conduct the local feature aggregation mechanism in a sliding window manner across the input image, which can more efficiently encode fine-level features that are essential for high-performance visual recognition.

Auto-scaling Vision Transformers without Training

TLDR
As-ViT is proposed, an auto-scaling framework for ViTs without training, which automatically discovers and scales up ViTs in an efficient and principled manner and proposes a progressive tokenization strategy to train ViTs faster and cheaper.

Convolutional Bypasses Are Better Vision Transformer Adapters

TLDR
Experimental results on VTAB-1k benchmark and few-shot learning datasets demonstrate that Convpass outperforms current language-oriented adaptation modules, demonstrating the necessity to tailor vision-oriented adapta- tion modules for vision models.

A Survey on Vision Transformer.

  • Kai HanYunhe Wang D. Tao
  • Computer Science
    IEEE transactions on pattern analysis and machine intelligence
  • 2022
TLDR
This paper reviews these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages, and takes a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer.

MaxViT: Multi-Axis Vision Transformer

TLDR
This paper introduces an efficient and scalable attention model, which consists of two aspects: blocked local and dilated global attention, and expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module.

CoAtNet: Marrying Convolution and Attention for All Data Sizes

TLDR
CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights that vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency.

Scaling Vision with Sparse Mixture of Experts

TLDR
This work presents a Vision MoE, a sparse version of the Vision Transformer that is scalable and competitive with the largest dense networks, and proposes an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute.

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

TLDR
A Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE, which has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively and scale up the model to 644M parameters and obtain the state-of-the-art classification performance.
...

References

SHOWING 1-10 OF 53 REFERENCES

Training data-efficient image transformers & distillation through attention

TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

Going deeper with Image Transformers

TLDR
This work builds and optimize deeper transformer networks for image classification and investigates the interplay of architecture and optimization of such dedicated transformers, making two architecture changes that significantly improve the accuracy of deep transformers.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

TLDR
The Pyramid Vision Transformer (PVT) is introduced, which overcomes the difficulties of porting Transformer to various dense prediction tasks and is validated through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation.

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

TLDR
A new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study and reduces the parameter count and MACs of vanilla ViT by half.

CoAtNet: Marrying Convolution and Attention for All Data Sizes

TLDR
CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights that vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency.

Emerging Properties in Self-Supervised Vision Transformers

TLDR
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

TLDR
A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.

Bottleneck Transformers for Visual Recognition

TLDR
BoTNet is presented, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation, and a simple adaptation of the BoTNet design for image classification is presented.

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks

TLDR
This work presents an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set, and reduces the computation time of self-attention from quadratic to linear in the number of Elements in the set.

Scaling Local Self-Attention for Parameter Efficient Visual Backbones

TLDR
A new self-attention model family, HaloNets, is developed which reach state-of-the-art accuracies on the parameter-limited setting of the ImageNet classification benchmark, and preliminary transfer learning experiments find that HaloNet models outperform much larger models and have better inference performance.
...