Corpus ID: 235485156

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

@article{Steiner2021HowTT,
  title={How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers},
  author={Andreas Steiner and Alexander Kolesnikov and Xiaohua Zhai and Ross Wightman and Jakob Uszkoreit and Lucas Beyer},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.10270}
}
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer’s weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation (“AugReg” for short) when training on smaller training datasets. We conduct a systematic empirical study… Expand

Figures and Tables from this paper

Segmenter: Transformer for Semantic Segmentation
TLDR
This paper introduces Segmenter, a transformer model for semantic segmentation that outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes. Expand
Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models
TLDR
It is shown that ViT outperforms state-of-the-art Convolutional Neural Networks (CNN) when using a small number of microstate images from the Ising model corresponding to various boundary conditions and temperatures. Expand
Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs
TLDR
Overall, it is found that the two architectures, especially ViT, are more robust than their CNN models, and frequency analysis suggests that the most robust ViT architectures tend to rely more on low-frequency features compared with CNNs. Expand
Exploring the Limits of Out-of-Distribution Detection
TLDR
It is demonstrated that large-scale pre-trained transformers can significantly improve the state-of-the-art (SOTA) on a range of near OOD tasks across different data modalities, and a new way of using just the names of outlier classes as a sole source of information without any accompanying images is explored. Expand
The Brownian motion in the transformer model
TLDR
A deep analysis of its multi-head self-attention (MHSA) module is given and it is found that each token is a random variable in high dimensional feature space and after layer normalization, these variables are mapped to points on the hyper-sphere. Expand
The Benchmark Lottery
TLDR
The notion of a benchmark lottery that describes the overall fragility of the ML benchmarking process is proposed and it is argued that this might lead to biased progress in the community. Expand
T RANSFORMER , AND MLP
  • 2021
Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation
  • Yao Qin, Chiyuan Zhang, Ting Chen, Balaji Lakshminarayanan, Alex Beutel, Xuezhi Wang
  • Computer Science
  • 2021
We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure, i.e., they process an image as a sequence of image patches. We findExpand

References

SHOWING 1-10 OF 43 REFERENCES
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention. Expand
CvT: Introducing Convolutions to Vision Transformers
TLDR
A new architecture, named Convolutional vision Transformer (CvT), is presented, that improves Vision Trans transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. Expand
CoAtNet: Marrying Convolution and Attention for All Data Sizes
TLDR
CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention and (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Expand
Do Better ImageNet Models Transfer Better?
TLDR
It is found that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy, and ImageNet features are less general than previously suggested. Expand
Fixing the train-test resolution discrepancy
TLDR
It is experimentally validated that, for a target test resolution, using a lower train resolution offers better classification at test time, and a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ is proposed. Expand
Emerging Properties in Self-Supervised Vision Transformers
TLDR
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets), and implements DINO, a form of self-distillation with no labels, which is implemented into a simple self- supervised method. Expand
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
TLDR
It is found that the performance on vision tasks increases logarithmically based on volume of training data size, and it is shown that representation learning (or pre-training) still holds a lot of promise. Expand
A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark
Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visualExpand
Unsupervised Data Augmentation for Consistency Training
TLDR
A new perspective on how to effectively noise unlabeled examples is presented and it is argued that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning. Expand
Big Transfer (BiT): General Visual Representation Learning
TLDR
By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. Expand
...
1
2
3
4
5
...