• Corpus ID: 229363322

Training data-efficient image transformers & distillation through attention

@inproceedings{Touvron2021TrainingDI,
  title={Training data-efficient image transformers \& distillation through attention},
  author={Hugo Touvron and Matthieu Cord and Matthijs Douze and Francisco Massa and Alexandre Sablayrolles and Herv'e J'egou},
  booktitle={ICML},
  year={2021}
}
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption by the larger community. In this work, with an adequate training scheme, we produce a competitive convolution-free transformer by training on Imagenet only. We train it on a single computer in less than 3 days. Our… 
Efficient Vision Transformers via Fine-Grained Manifold Distillation
TLDR
Transfer learning results on other classification benchmarks and downstream vision tasks also demonstrate the superiority of the proposed method over the state-of-the-art algorithms.
Self-Supervised Learning with Swin Transformers
TLDR
This paper presents a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture, tuned to achieve reasonably high accuracy on ImageNet-1K linear evaluation, and enables the learnt representations on downstream tasks such as object detection and semantic segmentation.
Training Vision Transformers with Only 2040 Images
TLDR
This paper investigates how to train ViTs with limited data and gives theoretical analyses that the method (based on parametric instance discrimination) is superior to other methods in that it can capture both feature alignment and instance similarities.
Refiner: Refining Self-attention for Vision Transformers
TLDR
This work introduces a conceptually simple scheme, called refiner, to directly refine the selfattention maps of ViTs, and explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity.
DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers
TLDR
This work proposes an early knowledge distillation framework, which is termed as DearKD, to improve the dataency required by transformers and proposes a boundary-preserving intra-divergence loss based on DeepInversion to close the performance gap against the full-data counterpart.
A Survey of Visual Transformers
TLDR
This survey has reviewed over one hundred of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, and proposed the deformable attention module which combines the best of the sparse spatial sampling of deformable convo- lution, and the relation modeling capability of Transformers.
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
TLDR
A new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study and reduces the parameter count and MACs of vanilla ViT by half.
AutoFormer: Searching Transformers for Visual Recognition
TLDR
This work proposes a new one-shot architecture search framework, namely AutoFormer, dedicated to vision transformer search, which surpass the recent state-of-the-arts such as ViT and DeiT and achieves top-1 accuracy on ImageNet.
Learned Queries for Efficient Local Attention
TLDR
A new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner, much like convolutions, that shows improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
ViT-P: Rethinking Data-efficient Vision Transformers from Locality
TLDR
This work constrain the self-attention of ViT to have multi-scale localized receptive field so that optimal configuration can be learned and provides empirical evidence that proper constrain of receptive field can reduce the amount of training data for vision transformers.
...
...

References

SHOWING 1-10 OF 84 REFERENCES
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
TLDR
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Generative Pretraining From Pixels
TLDR
This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification.
Visual Transformers: Token-based Image Representation and Processing for Computer Vision
TLDR
This work represents images as a set of visual tokens and applies visual transformers to find relationships between visual semantic concepts to densely model relationships between them, and finds that this paradigm of token-based image representation and processing drastically outperforms its convolutional counterparts on image classification and semantic segmentation.
Fixing the train-test resolution discrepancy
TLDR
It is experimentally validated that, for a target test resolution, using a lower train resolution offers better classification at test time, and a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ is proposed.
MultiGrain: a unified image embedding for classes and instances
TLDR
A key component of MultiGrain is a pooling layer that takes advantage of high-resolution images with a network trained at a lower resolution that provides state-of-the-art classification accuracy when fed to a linear classifier.
Attention Augmented Convolutional Networks
TLDR
It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar.
AutoAugment: Learning Augmentation Policies from Data
TLDR
This paper describes a simple procedure called AutoAugment to automatically search for improved data augmentation policies, which achieves state-of-the-art accuracy on CIFAR-10, CIFar-100, SVHN, and ImageNet (without additional data).
Circumventing Outliers of AutoAugment with Knowledge Distillation
TLDR
It is revealed that AutoAugment may remove part of discriminative information from the training image and so insisting on the ground-truth label is no longer the best option, and knowledge distillation is made use that refers to the output of a teacher model to guide network training.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Rethinking the Inception Architecture for Computer Vision
TLDR
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
...
...