Corpus ID: 229363322

Training data-efficient image transformers & distillation through attention

@inproceedings{Touvron2021TrainingDI,
  title={Training data-efficient image transformers \& distillation through attention},
  author={Hugo Touvron and M. Cord and M. Douze and Francisco Massa and Alexandre Sablayrolles and Herv'e J'egou},
  booktitle={ICML},
  year={2021}
}
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption by the larger community. In this work, with an adequate training scheme, we produce a competitive convolution-free transformer by training on Imagenet only. We train it on a single computer in less than 3 days. Our… Expand
Efficient Vision Transformers via Fine-Grained Manifold Distillation
TLDR
This paper proposes to excavate useful information from the teacher transformer through the relationship between images and the divided patches and explores an efficient fine-grained manifold distillation approach that simultaneously calculates cross-images, cross-patch, and randomselected manifolds in teacher and student models. Expand
Self-Supervised Learning with Swin Transformers
TLDR
This paper presents a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture, tuned to achieve reasonably high accuracy on ImageNet-1K linear evaluation, and enables the learnt representations on downstream tasks such as object detection and semantic segmentation. Expand
Refiner: Refining Self-attention for Vision Transformers
TLDR
This work introduces a conceptually simple scheme, called refiner, to directly refine the selfattention maps of ViTs, and explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Expand
Scaling Vision Transformers
TLDR
A ViT model with two billion parameters is successfully trained, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy and performs well on few-shot learning. Expand
AutoFormer: Searching Transformers for Visual Recognition
TLDR
This work proposes a new one-shot architecture search framework, namely AutoFormer, dedicated to vision transformer search, which surpass the recent state-of-the-arts such as ViT and DeiT and achieves top-1 accuracy on ImageNet. Expand
Co-advise: Cross Inductive Bias Distillation
TLDR
Equipped with this cross inductive bias distillation method, the vision transformers (termed as CivT) outperform all previous transformers of the same architecture on ImageNet. Expand
Training Vision Transformers for Image Retrieval
TLDR
This work adopts vision transformers for generating image descriptors and train the resulting model with a metric learning objective, which combines a contrastive loss with a differential entropy regularizer, and shows consistent and significant improvements of transformers over convolutionbased approaches. Expand
Less is More: Pay Less Attention in Vision Transformers
TLDR
A hierarchical Transformer where pure multi-layer perceptrons (MLPs) are used to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers is proposed. Expand
Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet
TLDR
By slightly tuning the structure of vision transformers and introducing token labeling—a new training objective, these models are able to achieve better results than the CNN counterparts and other transformer-based classification models with similar amount of training parameters and computations. Expand
Incorporating Convolution Designs into Visual Transformers
TLDR
A new Convolution-enhanced image Transformer (CeiT) is proposed which combines the advantages of CNNs in extracting lowlevel features, strengthening locality, and theadvantages of Transformers in establishing long-range dependencies. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 84 REFERENCES
Generative Pretraining From Pixels
TLDR
This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. Expand
Fixing the train-test resolution discrepancy
TLDR
It is experimentally validated that, for a target test resolution, using a lower train resolution offers better classification at test time, and a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ is proposed. Expand
MultiGrain: a unified image embedding for classes and instances
TLDR
A key component of MultiGrain is a pooling layer that takes advantage of high-resolution images with a network trained at a lower resolution that provides state-of-the-art classification accuracy when fed to a linear classifier. Expand
Attention Augmented Convolutional Networks
TLDR
It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar. Expand
AutoAugment: Learning Augmentation Policies from Data
TLDR
This paper describes a simple procedure called AutoAugment to automatically search for improved data augmentation policies, which achieves state-of-the-art accuracy on CIFAR-10, CIFar-100, SVHN, and ImageNet (without additional data). Expand
Circumventing Outliers of AutoAugment with Knowledge Distillation
TLDR
It is revealed that AutoAugment may remove part of discriminative information from the training image and so insisting on the ground-truth label is no longer the best option, and knowledge distillation is made use that refers to the output of a teacher model to guide network training. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Rethinking the Inception Architecture for Computer Vision
TLDR
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. Expand
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. Expand
Bag of Tricks for Image Classification with Convolutional Neural Networks
TLDR
This paper examines a collection of training procedure refinements and empirically evaluates their impact on the final model accuracy through ablation study, and shows that by combining these refinements together, they are able to improve various CNN models significantly. Expand
...
1
2
3
4
5
...