CASTing Your Model: Learning to Localize Improves Self-Supervised Representations

  title={CASTing Your Model: Learning to Localize Improves Self-Supervised Representations},
  author={Ramprasaath R. Selvaraju and Karan Desai and Justin Johnson and Nikhil Vijay Naik},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Recent advances in self-supervised learning (SSL) have largely closed the gap with supervised ImageNet pretraining. Despite their success these methods have been primarily applied to unlabeled ImageNet images, and show marginal gains when trained on larger sets of uncurated images. We hypothesize that current SSL methods perform best on iconic images, and struggle on complex scene images with many objects. Analyzing contrastive SSL methods shows that they have poor visual grounding and receive… 

Figures and Tables from this paper

Exploring Localization for Self-supervised Fine-grained Contrastive Learning

Cross-view saliency alignment (CVSA) is introduced, a contrastive learning framework that first crops and swaps saliency regions of images as a novel view generation and then guides the model to localize on foreground objects via a cross-view alignment loss.

Learning Background Invariance Improves Generalization and Robustness in Self-Supervised Learning on ImageNet and Beyond

Through a systematic, comprehensive investigation, it is shown that background augmentations lead to improved generalization with substantial improvements in performance across a spectrum of state-of-the-art self-supervised methods (MoCo-v2, BYOL, SwAV) on a variety of tasks, even enabling performance on par with the supervised baseline.

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations

The results show that an approach like MoCo works surprisingly well across: (i) objectversus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets, and that MoCo learns spatially structured representations when trained with a multi-crop strategy.

Unsupervised Object-Level Representation Learning from Scene Images

Object-level Representation Learning (ORL) is introduced, a new self-supervised learning framework towards scene images that significantly improves the performance of self- Supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks.

Characterizing and Improving the Robustness of Self-Supervised Learning through Background Augmentations

This work investigates a class of simple, yet highly effective “background augmentations”, which encourage models to focus on semantically-relevant content by discouraging them from focusing on image backgrounds, and shows that background augmentations lead to substantial improvements in performance across a spectrum of state-of-the-art self-supervised methods on a variety of tasks.

Good helper is around you: Attention-driven Masked Image Modeling

Attention-driven Masking and Throwing Strategy (AMT), a plug-and-play module for masked image modeling, improves the linear probing accuracy of MAE, obtains an improved per- formance with respect to MAE and SimMIM, and achieves superior per-formance on downstream detection and segmentation tasks.

MSR: Making Self-supervised learning Robust to Aggressive Augmentations

This work proposes a new SSL paradigm, which counteracts the impact of semantic shift by balancing the role of weak and aggressively augmented pairs, and can better embrace the aggressive augmentations and neutralize the semantic shift problem.

Leveraging background augmentations to encourage semantic focus in self-supervised contrastive learning

This work investigates a class of simple, yet highly effective “background augmentations”, which encourage models to focus on semantically-relevant content by discouraging them from focusing on image backgrounds, and demonstrates that background augmentations improve robustness to a number of out of distribution settings.

Object-aware Contrastive Learning for Debiased Scene Representation

A novel object-aware contrastive learning framework that first localizes objects in a self-supervised manner and then debias scene correlations via appropriate data augmentations considering the inferred object locations, which demonstrates the effectiveness of the representation learning framework when trained under multi-object images or evaluated under the background (and distribution) shifted images.

Efficient Visual Pretraining with Contrastive Detection

This work introduces a new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations, leading to state-of-the-art transfer accuracy on a variety of downstream tasks, while requiring up to 10× less pretraining.



Distilling Localization for Self-Supervised Representation Learning

This paper visualizes and diagnosing classification errors, and proposes a data-driven approach for learning invariance to backgrounds, which first estimates foreground saliency in images and then creates augmentations by copy-and-pasting the foreground onto a variety of back-grounds.

Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases

This work demonstrates that approaches like MOCO and PIRL learn occlusion-invariant representations, but they fail to capture viewpoint and category instance invariance which are crucial components for object recognition, and proposes an approach to leverage unstructured videos to learn representations that possess higher viewpoint invariance.

Scaling and Benchmarking Self-Supervised Visual Representation Learning

It is shown that by scaling on various axes (including data size and problem 'hardness'), one can largely match or even exceed the performance of supervised pre-training on a variety of tasks such as object detection, surface normal estimation and visual navigation using reinforcement learning.

Unsupervised Representation Learning by Predicting Image Rotations

This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning.

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

This paper proposes an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons, and uses a swapped prediction mechanism where it predicts the cluster assignment of a view from the representation of another view.

DeepUSPS: Deep Robust Unsupervised Saliency Prediction With Self-Supervision

This work proposes a two-stage mechanism for robust unsupervised object saliency prediction, where the first stage involves refinement of the noisy pseudo labels generated from different handcrafted methods, and shows that this self-learning procedure outperforms all the existing unsuper supervised methods over different datasets.

What makes for good views for contrastive learning

This paper uses empirical analysis to better understand the importance of view selection, and argues that the mutual information (MI) between views should be reduced while keeping task-relevant information intact, and devise unsupervised and semi-supervised frameworks that learn effective views by aiming to reduce their MI.

Unsupervised Pre-Training of Image Features on Non-Curated Data

This work proposes a new unsupervised approach which leverages self-supervision and clustering to capture complementary statistics from large-scale data and validates its approach on 96 million images from YFCC100M, achieving state-of-the-art results among unsuper supervised methods on standard benchmarks.

Exploring the Limits of Weakly Supervised Pretraining

This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date.

Self-labelling via simultaneous clustering and representation learning

The proposed novel and principled learning formulation is able to self-label visual data so as to train highly competitive image representations without manual labels and yields the first self-supervised AlexNet that outperforms the supervised Pascal VOC detection baseline.