Emerging Properties in Self-Supervised Vision Transformers

  title={Emerging Properties in Self-Supervised Vision Transformers},
  author={Mathilde Caron and Hugo Touvron and Ishan Misra and Herv'e J'egou and Julien Mairal and Piotr Bojanowski and Armand Joulin},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) [16] that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets… 

Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

This work proposes a simple but still effective self-supervised learning (SSL) strategy to train ViTs, that without any external annotation, can significantly improve the results.

Patch-level Representation Learning for Self-supervised Vision Transformers

Self Patch is designed and demonstrated that it can significantly improve the performance of existing SSL methods for various visual tasks, including object detection and semantic segmentation, and improve the recent self-supervised ViT, DINO.

Self-Supervised Learning with Swin Transformers

This paper presents a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture, tuned to achieve reasonably high accuracy on ImageNet-1K linear evaluation, and enables the learnt representations on downstream tasks such as object detection and semantic segmentation.

Efficient Self-supervised Vision Transformers for Representation Learning

This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning and proposes a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result improves the quality of the learned vision representations.

A Closer Look at Self-supervised Lightweight Vision Transformers

This work mainly produces recipes for pre-training high-performance lightweight ViTs using masked-image-modeling-based MAE, namely MAE-lite, and reveals that properly-learned lower layers of the pre-trained models matter more than higher ones in data-sufficient downstream tasks.

Exploring Feature Self-relation for Self-supervised Transformer

Instead of conducting selfsupervised learning solely on feature embeddings from multiple views, this work utilizes the feature self-relations, i.e., pixel/channel-levelSelf-relation based learning further enhance the relation modeling ability of ViTs, resulting in strong representations that stably improve performance on multiple downstream tasks.

RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training

This paper incorporates local feature learning into self-supervised vision transformers via Reconstructive Pre-training (RePre), which extends contrastive frameworks by adding a branch for reconstructing raw image pixels in parallel with the existing contrastive objective.

Visual Representation Learning with Self-Supervised Attention for Low-Label High-Data Regime

This paper is the first to question if self-supervised vision transformers (SSL-ViTs) can be adapted to two important computer vision tasks in the low-label, high-data regime: few-shot image classification and zero- shot image retrieval.

Position Labels for Self-Supervised Vision Transformer

This work proposes to train ViT to rec-ognize the positional label of patches of the input image, this apparently simple task actually yields a meaningful self-supervisory task.

Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?

RELICv2 is the first representation learning method to consistently outperform the supervised baseline in a like-for-like comparison using a range of standard ResNet architectures and it is shown that despite using ResNet encoders, RELIC v2 is comparable to state-of-theart self-supervised vision transformers.



Self-supervised Pretraining of Visual Features in the Wild

The final SElf-supERvised (SEER) model, a RegNetY with 1.3B parameters trained on 1B random images with 512 GPUs achieves 84.2% top-1 accuracy, surpassing the best self-supervised pretrained model by 1% and confirming that self- Supervised learning works in a real world setting.

SEED: Self-supervised Distillation For Visual Representation

This paper proposes a new learning paradigm, named SElf-SupErvised Distillation (SEED), where a larger network is leverage to transfer its representational knowledge into a smaller architecture in a self-supervised fashion, and shows that SEED dramatically boosts the performance of small networks on downstream tasks.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

Whitening for Self-Supervised Representation Learning

This paper proposes a different direction and a new loss function for self-supervised learning which is based on the whitening of the latent-space features and empirically shows that this loss accelerates self- supervised training and the learned representations are much more effective for downstream tasks than previously published work.

Learning Representations by Predicting Bags of Visual Words

This work shows that the process of image discretization into visual words can provide the basis for very powerful self-supervised approaches in the image domain, thus allowing further connections to be made to related methods from the NLP domain that have been extremely successful so far.

A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Billion-scale semi-supervised learning for image classification

This paper proposes a pipeline, based on a teacher/student paradigm, that leverages a large collection of unlabelled images to improve the performance for a given target architecture, like ResNet-50 or ResNext.

What makes for good views for contrastive learning

This paper uses empirical analysis to better understand the importance of view selection, and argues that the mutual information (MI) between views should be reduced while keeping task-relevant information intact, and devise unsupervised and semi-supervised frameworks that learn effective views by aiming to reduce their MI.

Unsupervised Learning by Predicting Noise

This paper introduces a generic framework to train deep networks, end-to-end, with no supervision, to fix a set of target representations, called Noise As Targets (NAT), and to constrain the deep features to align to them.

S2-BNN: Bridging the Gap Between Self-Supervised Real and 1-bit Neural Networks via Guided Distribution Calibration

This paper presents a novel guided learning paradigm from real-valued to distill binary networks on the final prediction distribution, to minimize the loss and obtain desirable accuracy on BNNs.