• Corpus ID: 225039882

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well… 
Adversarially Robust Vision Transformers
Multimodal Learning with Transformers: A Survey
A comprehensive survey of Transformer techniques oriented at multimodal data and a discussion of open problems and potential research directions for the community are presented.
Three things everyone should know about Vision Transformers
Adding MLP-based patch pre-processing layers improves Bert-like self-supervised training based on patch masking and saves compute, reduces the peak memory consumption at fine-tuning time, and allows sharing the majority of weights across tasks.
Space-time Mixing Attention for Video Transformer
This work proposes a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Trans transformer model and shows how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost.
Expanding Language-Image Pretrained Models for General Video Recognition
To capture the long-range dependencies of frames along the temporal dimension, a cross-frame attention mechanism that explicitly exchanges information across frames is proposed that is lightweight and can be plugged into pretrained language-image models seamlessly.
A Novel Transformer Network with Shifted Window Cross-Attention for Spatiotemporal Weather Forecasting
This work proposes the use of Video Swin-Transformer, coupled with a dedicated augmentation scheme for weather forecasting using a video transformer network, and employs gradual spatial reduction on the encoder side and cross-attention on the decoder.
SdAE: Self-distillated Masked Autoencoder
A simple Self-distillated masked AutoEncoder network, namely SdAE, which consists of a student branch using an encoder-decoder structure to reconstruct the missing information, and a teacher branch producing latent representation of masked tokens, which general-izes well and can reduce the computational complexity.
Contrastive Masked Autoencoders are Stronger Vision Learners
Masked image modeling (MIM) has achieved promising results on various vision tasks. However, the limited discriminability of learned representation manifests there is still plenty to go for making a
S-Prompts Learning with Pre-trained Transformers: An Occam's Razor for Domain Incremental Learning
This paper proposes one simple paradigm (named as S-Prompting) and two concrete approaches to highly reduce the forgetting degree in one of the most typical continual learning scenarios, i.e., domain increment learning (DIL).
Locality Guidance for Improving Vision Transformers on Tiny Datasets
The locality guidance for VTs is proposed by imitating the features of an already trained convolutional neural network (CNN) inspired by the built-in local-to-global hierarchy of CNN to improve the performance of VTs on tiny datasets.


Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Big Transfer (BiT): General Visual Representation Learning
By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples.
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
This paper factorizes 2D self-attention into two 1Dself-attentions, a novel building block that one could stack to form axial-att attention models for image classification and dense prediction, and achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
Stand-Alone Self-Attention in Vision Models
The results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox and is especially impactful when used in later layers.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
End-to-End Object Detection with Transformers
This work presents a new method that views object detection as a direct set prediction problem, and demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.
Self-Training With Noisy Student Improves ImageNet Classification
We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On
Acceleration of stochastic approximation by averaging
Convergence with probability one is proved for a variety of classical optimization and identification problems and it is demonstrated for these problems that the proposed algorithm achieves the highest possible rate of convergence.
Non-local Neural Networks
This paper presents non-local operations as a generic family of building blocks for capturing long-range dependencies in computer vision and improves object detection/segmentation and pose estimation on the COCO suite of tasks.