• Corpus ID: 233444273

Emerging Properties in Self-Supervised Vision Transformers

@article{Caron2021EmergingPI,
  title={Emerging Properties in Self-Supervised Vision Transformers},
  author={Mathilde Caron and Hugo Touvron and Ishan Misra and Herv'e J'egou and Julien Mairal and Piotr Bojanowski and Armand Joulin},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.14294}
}
In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) [16] that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets… 
Self-Supervised Learning with Swin Transformers
TLDR
This paper presents a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture, tuned to achieve reasonably high accuracy on ImageNet-1K linear evaluation, and enables the learnt representations on downstream tasks such as object detection and semantic segmentation.
Efficient Self-supervised Vision Transformers for Representation Learning
TLDR
This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning and proposes a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations.
Self-Distilled Self-Supervised Representation Learning
TLDR
The method, Self-Distilled Self-Supervised Learning (SDSSL), outperforms competitive baselines (SimCLR, BYOL and MoCo v3) using ViT on various tasks and datasets and not only leads to superior performance in the final layers, but also in most of the lower layers.
MST: Masked Self-Supervised Transformer for Visual Representation
TLDR
This paper proposes a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning.
Intriguing Properties of Vision Transformers
TLDR
Effective features of ViTs are shown to be due to flexible and dynamic receptive fields possible via self-attention mechanisms, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms.
Vision Transformers are Robust Learners
TLDR
This work studies the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples, and conducts a comprehensive performance comparison of ViT models and SOTA convolutional neural networks (CNNs), Big-Transfer.
Semi-Supervised Vision Transformers
TLDR
A joint semi-supervised learning framework, Semiformer, which contains a Transformer branch, a Convolutional branch and a carefully designed fusion module for knowledge sharing between the branches, which is compatible with most modern Transformer and convolutional neural architectures.
Scaled ReLU Matters for Training Vision Transformers
TLDR
It is verified, both theoretically and empirically, that scaled ReLU in the conv-stem matters for the robust ViTs training and not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops.
SLIP: Self-supervision meets Language-Image Pre-training
TLDR
This work introduces SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training and finds that SLIP enjoys the best of both worlds: better performance than self- supervision and language supervision.
Characterizing and Improving the Robustness of Self-Supervised Learning through Background Augmentations
TLDR
This work investigates a class of simple, yet highly effective “background augmentations”, which encourage models to focus on semantically-relevant content by discouraging them from focusing on image backgrounds, and shows that background augmentations lead to substantial improvements in performance across a spectrum of state-of-the-art self-supervised methods on a variety of tasks.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 85 REFERENCES
Self-supervised Pretraining of Visual Features in the Wild
TLDR
The final SElf-supERvised (SEER) model, a RegNetY with 1.3B parameters trained on 1B random images with 512 GPUs achieves 84.2% top-1 accuracy, surpassing the best self-supervised pretrained model by 1% and confirming that self- Supervised learning works in a real world setting.
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Whitening for Self-Supervised Representation Learning
TLDR
This paper proposes a different direction and a new loss function for self-supervised learning which is based on the whitening of the latent-space features and empirically shows that this loss accelerates self- supervised training and the learned representations are much more effective for downstream tasks than previously published work.
Learning Representations by Predicting Bags of Visual Words
TLDR
This work shows that the process of image discretization into visual words can provide the basis for very powerful self-supervised approaches in the image domain, thus allowing further connections to be made to related methods from the NLP domain that have been extremely successful so far.
Billion-scale semi-supervised learning for image classification
TLDR
This paper proposes a pipeline, based on a teacher/student paradigm, that leverages a large collection of unlabelled images to improve the performance for a given target architecture, like ResNet-50 or ResNext.
Unsupervised Learning by Predicting Noise
TLDR
This paper introduces a generic framework to train deep networks, end-to-end, with no supervision, to fix a set of target representations, called Noise As Targets (NAT), and to constrain the deep features to align to them.
Self-labelling via simultaneous clustering and representation learning
TLDR
The proposed novel and principled learning formulation is able to self-label visual data so as to train highly competitive image representations without manual labels and yields the first self-supervised AlexNet that outperforms the supervised Pascal VOC detection baseline.
Local Aggregation for Unsupervised Learning of Visual Embeddings
TLDR
This work describes a method that trains an embedding function to maximize a metric of local aggregation, causing similar data instances to move together in the embedding space, while allowing dissimilar instances to separate.
Unsupervised Deep Learning by Neighbourhood Discovery
TLDR
This work introduces a generic unsupervised deep learning approach to training deep models without the need for any manual label supervision and progressively discover sample anchored/centred neighbourhoods to reason and learn the underlying class decision boundaries iteratively and accumulatively.
Unsupervised Pre-Training of Image Features on Non-Curated Data
TLDR
This work proposes a new unsupervised approach which leverages self-supervision and clustering to capture complementary statistics from large-scale data and validates its approach on 96 million images from YFCC100M, achieving state-of-the-art results among unsuper supervised methods on standard benchmarks.
...
1
2
3
4
5
...