• Corpus ID: 236772573

Object-aware Contrastive Learning for Debiased Scene Representation

  title={Object-aware Contrastive Learning for Debiased Scene Representation},
  author={Sangwoo Mo and Hyun Bin Kang and Kihyuk Sohn and Chun-Liang Li and Jinwoo Shin},
Contrastive self-supervised learning has shown impressive results in learning visual representations from unlabeled images by enforcing invariance against different data augmentations. However, the learned representations are often contextually biased to the spurious scene correlations of different objects or object and background, which may harm their generalization on the downstream tasks. To tackle the issue, we develop a novel object-aware contrastive learning framework that first (a… 
PreViTS: Contrastive Pretraining with Video Tracking Supervision
This work proposes PreViTS, an SSL framework that utilizes an unsupervised tracking signal for selecting clips containing the same object, which helps better utilize temporal transformations of objects.
Can domain adaptation make object recognition work for everyone?
The inefficacy of standard DA methods at Geographical DA is demonstrated, highlighting the need for specialized geographical adaptation solutions to address the challenge of making object recognition work for everyone.
CYBORGS: Contrastively Bootstrapping Object Representations by Grounding in Segmentation
This work uses segmentation masks to train a model with a mask-dependent contrastive loss, and uses the partially trained model to bootstrap better masks to improve segmentation throughout pretraining.
Using the Order of Tomographic Slices as a Prior for Neural Networks Pre-Training
This work proposes a pre-training method SortingLoss, which performs pre- training on slices instead of volumes, so that a model could be fine-tuned on a sparse set of slices, without the interpolation step.


Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases
This work demonstrates that approaches like MOCO and PIRL learn occlusion-invariant representations, but they fail to capture viewpoint and category instance invariance which are crucial components for object recognition, and proposes an approach to leverage unstructured videos to learn representations that possess higher viewpoint invariance.
Leveraging background augmentations to encourage semantic focus in self-supervised contrastive learning
This work investigates a class of simple, yet highly effective “background augmentations”, which encourage models to focus on semantically-relevant content by discouraging them from focusing on image backgrounds, and demonstrates that background augmentations improve robustness to a number of out of distribution settings.
Online Object Representations with Contrastive Learning
A self-supervised approach for learning representations of objects from monocular videos is proposed and found that given a limited set of objects, object correspondences will naturally emerge when using contrastive learning without requiring explicit positive pairs.
i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning
i-Mix is proposed, a simple yet effective domain-agnostic regularization strategy for improving contrastive representation learning that consistently improves the quality of learned representations across domains, including image, speech, and tabular data.
CASTing Your Model: Learning to Localize Improves Self-Supervised Representations
Comparative Attention-Supervised Tuning (CAST) is proposed, which uses unsupervised saliency maps to intelligently sample crops, and to provide grounding supervision via a Grad-CAM attention loss to overcome contrastive SSL methods' limitations.
SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization
This work proposes SaliencyMix, a new state-of-the-art top-1 error-reducing model that carefully selects a representative image patch with the help of a saliency map and mixes this indicative patch with a target image that leads the model to learn more appropriate feature representation.
CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features
Patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches, and CutMix consistently outperforms state-of-the-art augmentation strategies on CIFAR and ImageNet classification tasks, as well as on ImageNet weakly-supervised localization task.
Self-supervised object detection from audio-visual correspondence
This work extracts a supervisory signal from audio-visual data, using the audio component to “teach” the object detector, and outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization.
Evaluating Weakly Supervised Object Localization Methods Right
It is argued that WSOL task is ill-posed with only image-level labels, and a new evaluation protocol is proposed where full supervision is limited to only a small held-out set not overlapping with the test set.
Revisiting Self-Supervised Visual Representation Learning
This study revisits numerous previously proposed self-supervised models, conducts a thorough large scale study and uncovers multiple crucial insights about standard recipes for CNN design that do not always translate to self- supervised representation learning.