• Corpus ID: 232428161

Going deeper with Image Transformers

@article{Touvron2021GoingDW,
  title={Going deeper with Image Transformers},
  author={Hugo Touvron and Matthieu Cord and Alexandre Sablayrolles and Gabriel Synnaeve and Herv'e J'egou},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.17239}
}
Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of vision transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. In particular, we investigate the interplay of architecture and optimization of such dedicated transformers. We make two architecture changes that significantly… 
Less is More: Pay Less Attention in Vision Transformers
TLDR
A hierarchical Transformer where pure multi-layer perceptrons (MLPs) are used to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers is proposed.
Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet
TLDR
By slightly tuning the structure of vision transformers and introducing token labeling—a new training objective, these models are able to achieve better results than the CNN counterparts and other transformer-based classification models with similar amount of training parameters and computations.
Improve Vision Transformers Training by Suppressing Over-smoothing
TLDR
This work investigates how to stabilize the training of vision transformers without special structure modification, and proposes a number of techniques to alleviate this problem, including introducing additional loss functions to encourage diversity, prevent loss of information, and discriminate different patches by additional patch classification loss for Cutmix.
Scaled ReLU Matters for Training Vision Transformers
TLDR
It is verified, both theoretically and empirically, that scaled ReLU in the conv-stem matters for the robust ViTs training and not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops.
Vision Transformer for Small-Size Datasets
  • Seung Hoon Lee, Seunghyun Lee, Byung Cheol Song
  • Computer Science
    ArXiv
  • 2021
TLDR
This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solve the lack of locality inductive bias and enable it to learn from scratch even on small-size datasets.
Transformer in Convolutional Neural Networks
TLDR
The Hierarchical MHSA (H-MHSA), whose representation is computed in a hierarchical manner, is proposed, whose backbone TransCNN essentially inherits the advantages of both transformer and CNN.
KVT: k-NN Attention for Boosting Vision Transformers
TLDR
A sparse attention scheme, dubbed k-NN attention, which naturally inherits the local bias of CNNs without introducing convolutional operations, and allows for the exploration of long range correlation and filter out irrelevant tokens by choosing the most similar tokens from the entire image.
A Survey of Visual Transformers
  • Yang Liu, Yao Zhang, +7 authors Zhiqiang He
  • Computer Science
    ArXiv
  • 2021
TLDR
A comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks (classification, detection, and segmentation) is provided, where a taxonomy is proposed to organize these methods according to their motivations, structures, and usage scenarios.
Searching for Efficient Multi-Stage Vision Transformers
TLDR
This work proposes ViT-ResNAS, an efficient multi-stage ViT architecture designed with neural architecture search (NAS) that achieves better accuracy-MACs and accuracy-throughput trade-offs than the original DeiT and other strong baselines of ViT.
CoAtNet: Marrying Convolution and Attention for All Data Sizes
TLDR
CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights: depthwise Convolution and self-Attention can be naturally unified via simple relative attention and vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 87 REFERENCES
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Training Vision Transformers for Image Retrieval
TLDR
This work adopts vision transformers for generating image descriptors and train the resulting model with a metric learning objective, which combines a contrastive loss with a differential entropy regularizer, and shows consistent and significant improvements of transformers over convolutionbased approaches.
Improving Transformer Optimization Through Better Initialization
TLDR
This work investigates and empirically validate the source of optimization problems in the encoder-decoder Transformer architecture; it proposes a new weight initialization scheme with theoretical justification, that enables training without warmup or layer normalization, and achieves leading accuracy.
Aggregated Residual Transformations for Deep Neural Networks
TLDR
On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity.
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Image Transformer
TLDR
This work generalizes a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood, and significantly increases the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks.
High-Performance Large-Scale Image Recognition Without Normalization
TLDR
An adaptive gradient clipping technique is developed which overcomes instabilities in batch normalization, and a significantly improved class of Normalizer-Free ResNets is designed which attain significantly better performance when finetuning on ImageNet.
Going deeper with convolutions
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition
Deep Networks with Stochastic Depth
TLDR
Stochastic depth is proposed, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time and reduces training time substantially and improves the test error significantly on almost all data sets that were used for evaluation.
Characterizing signal propagation to close the performance gap in unnormalized ResNets
TLDR
A simple set of analysis tools to characterize signal propagation on the forward pass is proposed, and this technique preserves the signal in networks with ReLU or Swish activation functions by ensuring that the per-channel activation means do not grow with depth.
...
1
2
3
4
5
...