• Corpus ID: 235422522

Delving Deep into the Generalization of Vision Transformers under Distribution Shifts

@article{Zhang2021DelvingDI,
  title={Delving Deep into the Generalization of Vision Transformers under Distribution Shifts},
  author={Chongzhi Zhang and Mingyuan Zhang and Shanghang Zhang and Daisheng Jin and Qiang-feng Zhou and Zhongang Cai and Haiyu Zhao and Shuai Yi and Xianglong Liu and Ziwei Liu},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.07617}
}
Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first present a taxonomy of DS. We then perform extensive evaluations of ViT variants under different DS and compare their generalization with Convolutional Neural Network (CNN) models. Important… 
$\textrm{D}^3\textrm{Former}$: Debiased Dual Distilled Transformer for Incremental Learning
TLDR
A Debiased Dual Distilled Transformer for CIL dubbed D 3 Former is developed, which leverages a hybrid nested ViT design to ensure data efficiency and scalability to small as well as large datasets.
Gun identification from gunshot audios for secure public places using transformer learning
TLDR
This research focuses on gun-type (rifle, handgun, none) detection based on the audio of its shot, and compared both convolution-based and fully self-attention-based (transformers) architectures.
Self-Distilled Vision Transformer for Domain Generalization
TLDR
Inspired by the modular architecture of ViTs, a simple DG approach for ViTs is proposed, coined as self-distillation forViTs, which reduces the over-distribution to source domains by easing the learning of input-output mapping problem through curating non-zero entropy supervisory signals for intermediate transformer blocks.
INDIGO: Intrinsic Multimodality for Domain Generalization
TLDR
This work proposes IntriNsic multimodality for DomaIn GeneralizatiOn (INDIGO), a simple and elegant way of leveraging the intrinsic modality present in these pre-trained multimodal networks along with the visual modality to enhance generalization to unseen domains at test-time.
Toward Real-world Single Image Deraining: A New Benchmark and Beyond
TLDR
The experimental results show the difference of representative methods in image restoration performance and model complexity, validate the significance of the proposed datasets for model generalization, and provide useful insights on the superiority of learning from diverse domains and shed lights on the future research on real-world SID.
Can CNNs Be More Robust Than Transformers?
TLDR
This paper examines the design of Transformers to build pure CNN architectures without any attention-like operations that is as robust as, or even more robust than, Transformers, and develops three highly effective architecture designs for boosting robustness.
MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers
TLDR
This paper investigates the backbone networks toward the generalization of monocular depth estimation, and designs a CNN-Transformer hybrid network with a multi-level adaptive feature fusion module, called MonoFormer, which achieves state-of-the-art performance with various public datasets.
Are Vision Transformers Robust to Spurious Correlations?
TLDR
This study reveals that when pre-trained on a sufficiently large dataset, ViT models are more robust to spurious correlations than CNNs, and the role of the self-attention mechanism in providing robustness under spuriously correlated environments is understood.
Domain generalization in deep learning-based mass detection in mammography: A large-scale multi-center study
TLDR
A single-source mass detection training pipeline is designed to improve the domain generalization of deep learning methods for mass detection in digital mammography and analyze in-depth the sources of domain shift in a large-scale multi-center setting.
Transformers in Medical Imaging: A Survey
TLDR
This survey surveys the use of Transformers in medical image segmentation, detection, classification, reconstruction, synthesis, registration, clinical report generation, and other tasks and develops taxonomy for each application.
...
...

References

SHOWING 1-10 OF 46 REFERENCES
Moment Matching for Multi-Source Domain Adaptation
TLDR
A new deep learning approach, Moment Matching for Multi-Source Domain Adaptation (M3SDA), which aims to transfer knowledge learned from multiple labeled source domains to an unlabeled target domain by dynamically aligning moments of their feature distributions.
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
TLDR
This paper standardizes and expands the corruption robustness topic, while showing which classifiers are preferable in safety-critical applications, and proposes a new dataset called ImageNet-P which enables researchers to benchmark a classifier's robustness to common perturbations.
Gradcam: Visual explanations from deep networks via gradientbased localization
  • In Proceedings of the IEEE international conference on computer vision,
  • 2017
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Semi-Supervised Domain Adaptation via Minimax Entropy
TLDR
A novel Minimax Entropy (MME) approach that adversarially optimizes an adaptive few-shot model for semi-supervised domain adaptation (SSDA) setting, setting a new state of the art for SSDA.
Unsupervised Domain Adaptation by Backpropagation
TLDR
The method performs very well in a series of image classification experiments, achieving adaptation effect in the presence of big domain shifts and outperforming previous state-of-the-art on Office datasets.
Are Transformers More Robust Than CNNs?
TLDR
This paper challenges the previous belief that Transformers outshine CNNs when measuring adversarial robustness, and suggests CNNs can easily be as robust as Transformers on defending against adversarial attacks, if they properly adopt Transformers’ training recipes.
Intriguing Properties of Vision Transformers
TLDR
Effective features of ViTs are shown to be due to flexible and dynamic receptive fields possible via self-attention mechanisms, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms.
Vision Transformers are Robust Learners
TLDR
This work uses six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT models and SOTA convolutional neural networks (CNNs), Big-Transfer (Kolesnikov et al. 2020) and presents analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
Emerging Properties in Self-Supervised Vision Transformers
TLDR
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.
...
...