• Publications
  • Influence
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
TLDR
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. Expand
iCaRL: Incremental Classifier and Representation Learning
TLDR
iCaRL can learn many classes incrementally over a long period of time where other strategies quickly fail, and distinguishes it from earlier works that were fundamentally limited to fixed data representations and therefore incompatible with deep learning architectures. Expand
Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation
TLDR
It is shown experimentally that training a deep convolutional neural network using the proposed loss function leads to substantially better segmentations than previous state-of-the-art methods on the challenging PASCAL VOC 2012 dataset. Expand
MLP-Mixer: An all-MLP Architecture for Vision
TLDR
It is shown that while convolutions and attention are both sufficient for good performance, neither of them are necessary, and MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs), attains competitive scores on image classification benchmarks. Expand
The Open Images Dataset V4
TLDR
In-depth comprehensive statistics about the dataset are provided, the quality of the annotations are validated, the performance of several modern models evolves with increasing amounts of training data, and two applications made possible by having unified annotations of multiple types coexisting in the same images are demonstrated. Expand
Big Transfer (BiT): General Visual Representation Learning
TLDR
By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. Expand
S4L: Self-Supervised Semi-Supervised Learning
TLDR
It is shown that S4L and existing semi-supervised methods can be jointly trained, yielding a new state-of-the-art result on semi- supervised ILSVRC-2012 with 10% of labels. Expand
Revisiting Self-Supervised Visual Representation Learning
TLDR
This study revisits numerous previously proposed self-supervised models, conducts a thorough large scale study and uncovers multiple crucial insights about standard recipes for CNN design that do not always translate to self- supervised representation learning. Expand
Are we done with ImageNet?
TLDR
A significantly more robust procedure for collecting human annotations of the ImageNet validation set is developed, which finds the original ImageNet labels to no longer be the best predictors of this independently-collected set, indicating that their usefulness in evaluating vision models may be nearing an end. Expand
Scaling Vision Transformers
TLDR
A ViT model with two billion parameters is successfully trained, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy and performs well on few-shot learning. Expand
...
1
2
3
...