Corpus ID: 235485156

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

@article{Steiner2021HowTT,
  title={How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers},
  author={Andreas Steiner and Alexander Kolesnikov and Xiaohua Zhai and Ross Wightman and Jakob Uszkoreit and Lucas Beyer},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.10270}
}
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer’s weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation (“AugReg” for short) when training on smaller training datasets. We conduct a systematic empirical study… Expand

Figures and Tables from this paper

Discrete Representations Strengthen Vision Transformer Robustness
TLDR
Adding discrete tokens produced by a vector-quantized encoder to ViT’s input layer strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet. Expand
Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding
TLDR
Here it is claimed the use of PreLayerNorm, a modified patch embedding structure to ensure scale-invariant behavior of ViT, which showed improved robustness in various corruptions including contrast-varying environments. Expand
Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation
TLDR
It is shown that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks and are complementary to traditional (positive) data augmentation, and together boost the performance further. Expand
Segmenter: Transformer for Semantic Segmentation
TLDR
This paper introduces Segmenter, a transformer model for semantic segmentation that outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes. Expand
Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs
TLDR
Overall, it is found that the two architectures, especially ViT, are more robust than their CNN models, and frequency analysis suggests that the most robust ViT architectures tend to rely more on low-frequency features compared with CNNs. Expand
Vision Transformer for Classification of Breast Ultrasound Images
TLDR
ViT models have comparable efficiency with or even better than the CNNs in classification of US breast images using different augmentation strategies and the performance is compared with the state-of-the-art CNNs. Expand
Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints
  • Jaesin Ahn, Jiuk Hong, Jeongwoo Ju, Heechul Jung
  • Computer Science
  • ArXiv
  • 2021
TLDR
Three types of structures for Q, K , and V embedding are proposed and demonstrated, demonstrating the superior image classification performance of the proposed approaches in experiments compared to several state-of-the-art approaches. Expand
Exploring the Limits of Out-of-Distribution Detection
TLDR
It is demonstrated that large-scale pre-trained transformers can significantly improve the state-of-the-art (SOTA) on a range of near OOD tasks across different data modalities, and a new way of using just the names of outlier classes as a sole source of information without any accompanying images is explored. Expand
LiT: Zero-Shot Transfer with Locked-image Text Tuning
This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study weExpand
MEMO: Test Time Robustness via Adaptation and Augmentation
TLDR
This work proposes a simple approach that can be used in any test setting where the model is probabilistic and adaptable: when presented with a test example, perform different data augmentations on the data point, and adapt (all of) the model parameters by minimizing the entropy of the model’s average, or marginal, output distribution across the augmentations. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 43 REFERENCES
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention. Expand
CvT: Introducing Convolutions to Vision Transformers
TLDR
A new architecture, named Convolutional vision Transformer (CvT), is presented, that improves Vision Trans transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. Expand
CoAtNet: Marrying Convolution and Attention for All Data Sizes
TLDR
CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention and (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Expand
Do Better ImageNet Models Transfer Better?
TLDR
It is found that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy, and ImageNet features are less general than previously suggested. Expand
Fixing the train-test resolution discrepancy
TLDR
It is experimentally validated that, for a target test resolution, using a lower train resolution offers better classification at test time, and a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ is proposed. Expand
Emerging Properties in Self-Supervised Vision Transformers
TLDR
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets), and implements DINO, a form of self-distillation with no labels, which is implemented into a simple self- supervised method. Expand
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
TLDR
It is found that the performance on vision tasks increases logarithmically based on volume of training data size, and it is shown that representation learning (or pre-training) still holds a lot of promise. Expand
A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark
Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visualExpand
Unsupervised Data Augmentation for Consistency Training
TLDR
A new perspective on how to effectively noise unlabeled examples is presented and it is argued that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning. Expand
Big Transfer (BiT): General Visual Representation Learning
TLDR
By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. Expand
...
1
2
3
4
5
...