• Corpus ID: 229363322

Training data-efficient image transformers & distillation through attention

@inproceedings{Touvron2021TrainingDI,
  title={Training data-efficient image transformers \& distillation through attention},
  author={Hugo Touvron and Matthieu Cord and Matthijs Douze and Francisco Massa and Alexandre Sablayrolles and Herv'e J'egou},
  booktitle={ICML},
  year={2021}
}
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption by the larger community. In this work, with an adequate training scheme, we produce a competitive convolution-free transformer by training on Imagenet only. We train it on a single computer in less than 3 days. Our… 
Efficient Vision Transformers via Fine-Grained Manifold Distillation
TLDR
This paper proposes to excavate useful information from the teacher transformer through the relationship between images and the divided patches and explores an efficient fine-grained manifold distillation approach that simultaneously calculates cross-images, cross-patch, and randomselected manifolds in teacher and student models.
Self-Supervised Learning with Swin Transformers
TLDR
This paper presents a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture, tuned to achieve reasonably high accuracy on ImageNet-1K linear evaluation, and enables the learnt representations on downstream tasks such as object detection and semantic segmentation.
A Survey of Visual Transformers
  • Yang Liu, Yao Zhang, +7 authors Zhiqiang He
  • Computer Science
    ArXiv
  • 2021
TLDR
A comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks (classification, detection, and segmentation) is provided, where a taxonomy is proposed to organize these methods according to their motivations, structures, and usage scenarios.
Refiner: Refining Self-attention for Vision Transformers
TLDR
This work introduces a conceptually simple scheme, called refiner, to directly refine the selfattention maps of ViTs, and explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity.
Scaling Vision Transformers
TLDR
A ViT model with two billion parameters is successfully trained, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy and performs well on few-shot learning.
AutoFormer: Searching Transformers for Visual Recognition
TLDR
This work proposes a new one-shot architecture search framework, namely AutoFormer, dedicated to vision transformer search, which surpass the recent state-of-the-arts such as ViT and DeiT and achieves top-1 accuracy on ImageNet.
Co-advise: Cross Inductive Bias Distillation
TLDR
Equipped with this cross inductive bias distillation method, the vision transformers (termed as CivT) outperform all previous transformers of the same architecture on ImageNet.
Training Vision Transformers for Image Retrieval
TLDR
This work adopts vision transformers for generating image descriptors and train the resulting model with a metric learning objective, which combines a contrastive loss with a differential entropy regularizer, and shows consistent and significant improvements of transformers over convolutionbased approaches.
Less is More: Pay Less Attention in Vision Transformers
TLDR
A hierarchical Transformer where pure multi-layer perceptrons (MLPs) are used to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers is proposed.
Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet
TLDR
By slightly tuning the structure of vision transformers and introducing token labeling—a new training objective, these models are able to achieve better results than the CNN counterparts and other transformer-based classification models with similar amount of training parameters and computations.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 84 REFERENCES
Generative Pretraining From Pixels
TLDR
This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification.
Fixing the train-test resolution discrepancy
TLDR
It is experimentally validated that, for a target test resolution, using a lower train resolution offers better classification at test time, and a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ is proposed.
MultiGrain: a unified image embedding for classes and instances
TLDR
A key component of MultiGrain is a pooling layer that takes advantage of high-resolution images with a network trained at a lower resolution that provides state-of-the-art classification accuracy when fed to a linear classifier.
Attention Augmented Convolutional Networks
TLDR
It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar.
AutoAugment: Learning Augmentation Policies from Data
TLDR
This paper describes a simple procedure called AutoAugment to automatically search for improved data augmentation policies, which achieves state-of-the-art accuracy on CIFAR-10, CIFar-100, SVHN, and ImageNet (without additional data).
Circumventing Outliers of AutoAugment with Knowledge Distillation
TLDR
It is revealed that AutoAugment may remove part of discriminative information from the training image and so insisting on the ground-truth label is no longer the best option, and knowledge distillation is made use that refers to the output of a teacher model to guide network training.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Rethinking the Inception Architecture for Computer Vision
TLDR
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Bag of Tricks for Image Classification with Convolutional Neural Networks
TLDR
This paper examines a collection of training procedure refinements and empirically evaluates their impact on the final model accuracy through ablation study, and shows that by combining these refinements together, they are able to improve various CNN models significantly.
...
1
2
3
4
5
...