CASTing Your Model: Learning to Localize Improves Self-Supervised Representations

  title={CASTing Your Model: Learning to Localize Improves Self-Supervised Representations},
  author={Ramprasaath R. Selvaraju and Karan Desai and Justin Johnson and Nikhil Naik},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Recent advances in self-supervised learning (SSL) have largely closed the gap with supervised ImageNet pretraining. Despite their success these methods have been primarily applied to unlabeled ImageNet images, and show marginal gains when trained on larger sets of uncurated images. We hypothesize that current SSL methods perform best on iconic images, and struggle on complex scene images with many objects. Analyzing contrastive SSL methods shows that they have poor visual grounding and receive… Expand

Figures and Tables from this paper

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations
The results show that an approach like MoCo works surprisingly well across: (i) objectversus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets, and that MoCo learns spatially structured representations when trained with a multi-crop strategy. Expand
Unsupervised Object-Level Representation Learning from Scene Images
Object-level Representation Learning (ORL) is introduced, a new self-supervised learning framework towards scene images that significantly improves the performance of self- Supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks. Expand
Characterizing and Improving the Robustness of Self-Supervised Learning through Background Augmentations
This work investigates a class of simple, yet highly effective “background augmentations”, which encourage models to focus on semantically-relevant content by discouraging them from focusing on image backgrounds, and shows that background augmentations lead to substantial improvements in performance across a spectrum of state-of-the-art self-supervised methods on a variety of tasks. Expand
Leveraging background augmentations to encourage semantic focus in self-supervised contrastive learning
This work investigates a class of simple, yet highly effective “background augmentations”, which encourage models to focus on semantically-relevant content by discouraging them from focusing on image backgrounds, and demonstrates that background augmentations improve robustness to a number of out of distribution settings. Expand
Object-aware Contrastive Learning for Debiased Scene Representation
A novel object-aware contrastive learning framework that first localizes objects in a self-supervised manner and then debias scene correlations via appropriate data augmentations considering the inferred object locations, which demonstrates the effectiveness of the representation learning framework when trained under multi-object images or evaluated under the background (and distribution) shifted images. Expand
Efficient Visual Pretraining with Contrastive Detection
This work introduces a new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations, leading to state-of-the-art transfer performance from ImageNet to COCO, while requiring up to 5× less pretraining. Expand
Semi-weakly Supervised Contrastive Representation Learning for Retinal Fundus Images
This work considers weak labels in the form of pseudolabels and proposes a semi-weakly supervised contrastive learning (SWCL) framework for representation learning using semiweakly annotated images, which surpasses all prior self-supervised methods and standard cross-entropy training, while closing the gaps with ImageNet pretraining. Expand
Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos
A multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training is proposed and it is shown that stacking multiple attention layers is effective and, when combined with a word-to-region loss, achieves state of the art on recall-at-one and pointing hand accuracies. Expand
Dissecting Image Crops
The aim of this work is to dissect the fundamental impact of spatial crops, and there are also a number of practical implications to the work, such as detecting image manipulations and equipping neural network researchers with a better understanding of shortcut learning. Expand
Object-Aware Cropping for Self-Supervised Learning
  • Shlok Mishra, Anshul Shah, +4 authors Dilip Krishnan
  • Computer Science
  • 2021
A core component of the recent success of self-supervised learning is cropping data augmentation, which selects sub-regions of an image to be used as positive views in the self-supervised loss. TheExpand


Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases
This work demonstrates that approaches like MOCO and PIRL learn occlusion-invariant representations, but they fail to capture viewpoint and category instance invariance which are crucial components for object recognition, and proposes an approach to leverage unstructured videos to learn representations that possess higher viewpoint invariance. Expand
Scaling and Benchmarking Self-Supervised Visual Representation Learning
It is shown that by scaling on various axes (including data size and problem 'hardness'), one can largely match or even exceed the performance of supervised pre-training on a variety of tasks such as object detection, surface normal estimation and visual navigation using reinforcement learning. Expand
Unsupervised Representation Learning by Predicting Image Rotations
This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. Expand
DeepUSPS: Deep Robust Unsupervised Saliency Prediction With Self-Supervision
This work proposes a two-stage mechanism for robust unsupervised object saliency prediction, where the first stage involves refinement of the noisy pseudo labels generated from different handcrafted methods, and shows that this self-learning procedure outperforms all the existing unsuper supervised methods over different datasets. Expand
Unsupervised Pre-Training of Image Features on Non-Curated Data
This work proposes a new unsupervised approach which leverages self-supervision and clustering to capture complementary statistics from large-scale data and validates its approach on 96 million images from YFCC100M, achieving state-of-the-art results among unsuper supervised methods on standard benchmarks. Expand
Exploring the Limits of Weakly Supervised Pretraining
This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date. Expand
Self-labelling via simultaneous clustering and representation learning
The proposed novel and principled learning formulation is able to self-label visual data so as to train highly competitive image representations without manual labels and yields the first self-supervised AlexNet that outperforms the supervised Pascal VOC detection baseline. Expand
Context Encoders: Feature Learning by Inpainting
It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods. Expand
Deep Unsupervised Saliency Detection: A Multiple Noisy Labeling Perspective
This work presents a novel perspective to unsupervised saliency detection through learning from multiple noisy labeling generated by "weak" and "noisy" unsuper supervised handcrafted saliency methods. Expand
Supervision by Fusion: Towards Unsupervised Learning of Deep Salient Object Detector
  • Dingwen Zhang, Junwei Han, Yu Zhang
  • Computer Science
  • 2017 IEEE International Conference on Computer Vision (ICCV)
  • 2017
This paper makes the earliest effort to train a deep salient object detector without using any human annotation and can approach the same network trained with full supervision and even outperform a number of fully supervised state-of-the-art approaches. Expand