Corpus ID: 232417451

Rethinking Spatial Dimensions of Vision Transformers

@article{Heo2021RethinkingSD,
  title={Rethinking Spatial Dimensions of Vision Transformers},
  author={Byeongho Heo and Sangdoo Yun and Dongyoon Han and Sanghyuk Chun and Junsuk Choe and Seong Joon Oh},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.16302}
}
Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its… Expand

Figures and Tables from this paper

Rethinking the Design Principles of Robust Vision Transformer
Recent advances on Vision Transformers (ViT) have shown that self-attention-based networks, which take advantage of long-range dependencies modeling ability, surpassed traditional convolution neuralExpand
Searching for Efficient Multi-Stage Vision Transformers
Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to computer vision tasks and result in comparable performance to convolutional neural networksExpand
Vision Xformers: Efficient Attention for Image Classification
TLDR
This work modifications the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers, such as Performer, Linformer and Nyströmformer of linear complexity creating Vision X-formers (ViX), and shows that all three versions of ViX may be more accurate than ViT for image classification while using far fewer parameters and computational resources. Expand
Less is More: Pay Less Attention in Vision Transformers
TLDR
A hierarchical Transformer where pure multi-layer perceptrons (MLPs) are used to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers is proposed. Expand
Towards Robust Vision Transformer
TLDR
This work proposes Robust Vision Transformer (RVT), which is a new vision transformer and has superior performance with strong robustness, and proposes two new plug-and-play techniques called positionaware attention scaling and patch-wise augmentation to augment the RVT, which is abbreviate as RVT∗. Expand
KVT: k-NN Attention for Boosting Vision Transformers
TLDR
A sparse attention scheme, dubbed k-NN attention, which naturally inherits the local bias of CNNs without introducing convolutional operations, and allows for the exploration of long range correlation and at the same time filters out irrelevant tokens by choosing the most similar tokens from the entire image. Expand
A Survey on Vision Transformer
Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representationExpand
Scaled ReLU Matters for Training Vision Transformers
  • Pichao Wang, Xue Wang, +5 authors Rong Jin
  • Computer Science
  • ArXiv
  • 2021
Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the trainingExpand
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
TLDR
Experiments on ImageNet as well as downstream tasks prove the superiority of ViTAE over the baseline transformer and concurrent works, which has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively. Expand
Efficient Vision Transformers via Fine-Grained Manifold Distillation
TLDR
This paper proposes to excavate useful information from the teacher transformer through the relationship between images and the divided patches and explores an efficient fine-grained manifold distillation approach that simultaneously calculates cross-images, cross-patch, and randomselected manifolds in teacher and student models. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 46 REFERENCES
Deep Pyramidal Residual Networks
TLDR
This research gradually increases the feature map dimension at all units to involve as many locations as possible in the network architecture and proposes a novel residual unit capable of further improving the classification accuracy with the new network architecture. Expand
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention. Expand
Going deeper with convolutions
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual RecognitionExpand
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. Expand
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
TLDR
A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet. Expand
Stand-Alone Self-Attention in Vision Models
TLDR
The results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox and is especially impactful when used in later layers. Expand
CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features
TLDR
Patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches, and CutMix consistently outperforms state-of-the-art augmentation strategies on CIFAR and ImageNet classification tasks, as well as on ImageNet weakly-supervised localization task. Expand
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. Expand
Bag of Tricks for Image Classification with Convolutional Neural Networks
TLDR
This paper examines a collection of training procedure refinements and empirically evaluates their impact on the final model accuracy through ablation study, and shows that by combining these refinements together, they are able to improve various CNN models significantly. Expand
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective. Expand
...
1
2
3
4
5
...