Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning

@article{Xie2021PropagateYE,
  title={Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning},
  author={Zhenda Xie and Yutong Lin and Zheng Zhang and Yue Cao and Stephen Lin and Han Hu},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={16679-16688}
}
  • Zhenda Xie, Yutong Lin, Han Hu
  • Published 19 November 2020
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Contrastive learning methods for unsupervised visual representation learning have reached remarkable levels of transfer performance. We argue that the power of contrastive learning has yet to be fully unleashed, as current methods are trained only on instance-level pretext tasks, leading to representations that may be sub-optimal for downstream tasks requiring dense pixel predictions. In this paper, we introduce pixel-level pretext tasks for learning dense feature representations. The first… 

Figures and Tables from this paper

Unsupervised Person Re-identification via Simultaneous Clustering and Consistency Learning
TLDR
This work designs a pretext task for unsupervised re-identification by learning visual consistency from still images and temporal consistency during training process, such that the clustering network can separate the images into semantic clusters automatically.
Dense Siamese Network
TLDR
DenseSiam proves that the simple location correspondence and interacted region embeddings are effective enough to learn the similarity and surpasses state-of-the-art segmentation methods by 2.1 mIoU with 28% training costs.
Separated Contrastive Learning for Organ-at-Risk and Gross-Tumor-Volume Segmentation with Limited Annotation
TLDR
A separated region-level contrastive learning scheme, namely SepaReg, the core of which is to separate each image into regions and encode each region separately, which demonstrates the effectiveness of the proposed model, consistently achieving better results than state-of-the-art approaches.
Deeply Unsupervised Patch Re-Identification for Pre-training Object Detectors.
TLDR
Deeply Unsupervised Patch Re-ID (DUPR) is presented, a simple yet effective method for unsupervised visual representation learning that outperforms state-of-the-art un Supervised pre-trainings and even the ImageNet supervised pre-training on various downstream tasks related to object detection.
FisheyePixPro: Self-supervised pretraining using Fisheye images for semantic segmentation
TLDR
This is the first attempt to pretrain a contrastive learning based model, directly on fisheye images in a self-supervised approach, and achieves a significant improvement over the PixPro model.
Exploring Set Similarity for Dense Self-supervised Representation Learning
TLDR
This paper proposes to explore set similarity (SetSim) for dense self-supervised representation learning, and generalizes pixel-wise similarity learning to set-wise one to improve the robustness because sets contain more semantic and structure information.
Joint Learning of Localized Representations from Medical Images and Reports
TLDR
Localized representation learning from Vision and Text (LoVT) is proposed, to the best knowledge, the first text-supervised pre-training method that targets localized medical imaging tasks and performs best on 11 out of the 18 studied tasks making it the preferred method of choice for localized tasks.
Panoramic Panoptic Segmentation: Towards Complete Surrounding Understanding via Unsupervised Contrastive Learning
TLDR
This work introduces panoramic panoptic segmentation as the most holistic scene understanding both in terms of field of view and image level understanding for standard camera based input and proposes a framework which allows model training on standard pinhole images and transfers the learned features to a different domain.
E FFICIENT S ELF - SUPERVISED V ISION T RANSFORMERS FOR R EPRESENTATION L EARNING
TLDR
This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning and proposes a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result improves the quality of the learned vision representations.
Learning Where to Learn in Cross-View Self-Supervised Learning
TLDR
This paper reinterprets the projection head in SSL as a per-pixel projection and predicts a set of spatial alignment maps from the original features by this weight-sharing projection head, so that the projected embeddings could be exactly aligned and thus guide the feature learning better.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 39 REFERENCES
A Simple Framework for Contrastive Learning of Visual Representations
TLDR
It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
TLDR
This paper proposes an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons, and uses a swapped prediction mechanism where it predicts the cluster assignment of a view from the representation of another view.
Unsupervised Representation Learning by Predicting Image Rotations
TLDR
This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning.
Unsupervised Visual Representation Learning by Context Prediction
TLDR
It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset.
Generative Pretraining From Pixels
TLDR
This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification.
Fully Convolutional Networks for Semantic Segmentation
TLDR
It is shown that convolutional networks by themselves, trained end- to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation.
What makes for good views for contrastive learning
TLDR
This paper uses empirical analysis to better understand the importance of view selection, and argues that the mutual information (MI) between views should be reduced while keeping task-relevant information intact, and devise unsupervised and semi-supervised frameworks that learn effective views by aiming to reduce their MI.
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
TLDR
A novel unsupervised learning approach to build features suitable for object detection and classification and to facilitate the transfer of features to other tasks, the context-free network (CFN), a siamese-ennead convolutional neural network is introduced.
Local Aggregation for Unsupervised Learning of Visual Embeddings
TLDR
This work describes a method that trains an embedding function to maximize a metric of local aggregation, causing similar data instances to move together in the embedding space, while allowing dissimilar instances to separate.
Feature Pyramid Networks for Object Detection
TLDR
This paper exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost and achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles.
...
1
2
3
4
...