• Corpus ID: 234742221

Vision Transformers are Robust Learners

@article{Paul2021VisionTA,
  title={Vision Transformers are Robust Learners},
  author={Sayak Paul and Pin-Yu Chen},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.07581}
}
Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy with better parameter efficiency. Since self-attention helps a model systematically align different components present inside the input data, it leaves grounds to investigate its performance under model robustness benchmarks. In this… 
Towards Robust Vision Transformer
TLDR
Robust Vision Transformer (RVT) is proposed, which is a new vision transformer and has superior performance with strong robustness and generalization ability compared with previous ViTs and state-of-the-art CNNs.
Intriguing Properties of Vision Transformers
TLDR
Effective features of ViTs are shown to be due to flexible and dynamic receptive fields possible via self-attention mechanisms, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms.
On the Adversarial Robustness of Vision Transformers
  • Computer Science
  • 2022
TLDR
This work provides a comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations and suggests convolutional or tokens-to-token blocks for learning low-level features in ViTs can improve classification accuracy but at the cost of adversarial robustness.
A Comprehensive Study of Vision Transformers on Dense Prediction Tasks
TLDR
The extensive empirical results show that the features generated by VTs are more robust to distribution shifts, natural corruptions, and adversarial attacks in both tasks, whereas CNNs perform better at higher image resolutions in object detection.
Are Transformers More Robust Than CNNs?
TLDR
This paper challenges the previous belief that Transformers outshine CNNs when measuring adversarial robustness, and suggests CNNs can easily be as robust as Transformers on defending against adversarial attacks, if they properly adopt Transformers’ training recipes.
Deeper Insights into ViTs Robustness towards Common Corruptions
TLDR
This paper investigates how CNN-like architectural designs and CNN-based data augmentation strategies impact on ViTs’ robustness towards common corruptions through an extensive and rigorous benchmarking, and introduces a novel conditional method enabling input-varied augmentations from two angles.
On Improving Adversarial Transferability of Vision Transformers
TLDR
This work enhances transferability of existing attacks by introducing two novel strategies specific to the architecture of ViT models, including a method to find multiple discriminative pathways by dissecting a single ViT model into an ensemble of networks.
Can CNNs Be More Robust Than Transformers?
TLDR
This paper examines the design of Transformers to build pure CNN architectures without any attention-like operations that is as robust as, or even more robust than, Transformers, and develops three highly effective architecture designs for boosting robustness.
Are Vision Transformers Robust to Patch Perturbations?
TLDR
Surprisingly, it is found that vision transformers are more robust to naturally corrupted patches than CNNs, whereas they are more vulnerable to adversarial patches.
Delving Deep into the Generalization of Vision Transformers under Distribution Shifts
TLDR
This work comprehensively study the out-of-distribution (OOD) generalization of ViTs and designs the Generalization-Enhanced ViTs (GE-ViTs), a smoother learning strategy to achieve a stable training process and obtain performance improvements on OOD data.
...
...

References

SHOWING 1-10 OF 73 REFERENCES
On the Adversarial Robustness of Vision Transformers
TLDR
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations and finds that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs).
When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations
TLDR
By promoting smoothness with a recently proposed sharpness-aware optimizer, this paper substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning.
Understanding Robustness of Transformers for Image Classification
TLDR
It is found that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations, and Transformers are robust to the removal of almost any single layer.
On the Robustness of Vision Transformers to Adversarial Examples
TLDR
This paper studies the robustness of Vision Transformers to adversarial examples, and shows that an ensemble can achieve unprecedented robustness without sacrificing clean accuracy under a black-box adversary.
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
TLDR
A systematic empirical study finds that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data.
Emerging Properties in Self-Supervised Vision Transformers
TLDR
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks
TLDR
This work thoroughly study three key components of SRGAN – network architecture, adversarial loss and perceptual loss, and improves each of them to derive an Enhanced SRGAN (ESRGAN), which achieves consistently better visual quality with more realistic and natural textures than SRGAN.
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
TLDR
A new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study and reduces the parameter count and MACs of vanilla ViT by half.
Improved Regularization of Convolutional Neural Networks with Cutout
TLDR
This paper shows that the simple regularization technique of randomly masking out square regions of input during training, which is called cutout, can be used to improve the robustness and overall performance of convolutional neural networks.
...
...