• Publications
  • Influence
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Emerging Properties in Self-Supervised Vision Transformers
TLDR
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets), and implements DINO, a form of self-distillation with no labels, which is implemented into a simple self- supervised method.
Fixing the train-test resolution discrepancy
TLDR
It is experimentally validated that, for a target test resolution, using a lower train resolution offers better classification at test time, and a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ is proposed.
Going deeper with Image Transformers
TLDR
This work builds and optimize deeper transformer networks for image classification and investigates the interplay of architecture and optimization of such dedicated transformers, making two architecture changes that significantly improve the accuracy of deep transformers.
ResMLP: Feedforward networks for image classification with data-efficient training
TLDR
ResMLP is a simple residual network that alternates a linear layer in which image patches interact, independently and identically across channels, and a two-layer feed-forward network in which channels interact independently per patch.
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
TLDR
GPSA is introduced, a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias and outperforms the DeiT on ImageNet, while offering a much improved sample efficiency.
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference
TLDR
This work designs a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime and proposes LeViT, a hybrid neural network for fast inference image classification that significantly outperforms existing convnets and vision transformers.
Fixing the train-test resolution discrepancy: FixEfficientNet
TLDR
This strategy is advantageously combined with recent training recipes from the literature and significantly outperforms the initial architecture with the same number of parameters, and establishes the new state of the art for ImageNet with a single crop.
XCiT: Cross-Covariance Image Transformers
TLDR
This work proposes a “transposed” version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries, and has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
ResNet strikes back: An improved training procedure in timm
TLDR
This paper re-evaluate the performance of the vanilla ResNet-50 when trained with a procedure that integrates such advances, and shares competitive training settings and pre-trained models in the timm open-source library, with the hope that they will serve as better baselines for future work.
...
1
2
...