Transitive Invariance for Self-Supervised Visual Representation Learning

@article{Wang2017TransitiveIF,
  title={Transitive Invariance for Self-Supervised Visual Representation Learning},
  author={X. Wang and Kaiming He and Abhinav Kumar Gupta},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={1338-1347}
}
  • X. Wang, Kaiming He, A. Gupta
  • Published 9 August 2017
  • Computer Science
  • 2017 IEEE International Conference on Computer Vision (ICCV)
Learning visual representations with self-supervised learning has become popular in computer vision. The idea is to design auxiliary tasks where labels are free to obtain. Most of these tasks end up providing data to learn specific kinds of invariance useful for recognition. In this paper, we propose to exploit different self-supervised approaches to learn representations invariant to (i) inter-instance variations (two objects in the same class should have similar features) and (ii) intra… 
Distilling Localization for Self-Supervised Representation Learning
TLDR
This paper visualizes and diagnosing classification errors, and proposes a data-driven approach for learning invariance to backgrounds, which first estimates foreground saliency in images and then creates augmentations by copy-and-pasting the foreground onto a variety of backgrounds.
Self-Supervised Learning of Video-Induced Visual Invariances
TLDR
Training models using different variants of the proposed framework on videos from the YouTube-8M (YT8M) data set obtain state-of-the-art self-supervised transfer learning results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB), using only 1000 labels per task.
Trading robust representations for sample complexity through self-supervised visual experience
TLDR
The results suggest that equivalence sets other than class labels, which are abundant in unlabeled visual experience, can be used for self-supervised learning of semantically relevant image embeddings.
WHAT SHOULD NOT BE CONTRASTIVE
TLDR
This work introduces a contrastive learning framework which does not require prior knowledge of specific, task-dependent invariances, and learns to capture varying and invariant factors for visual representations by constructing separate embedding spaces, each of which is invariant to all but one augmentation.
Scaling and Benchmarking Self-Supervised Visual Representation Learning
TLDR
It is shown that by scaling on various axes (including data size and problem 'hardness'), one can largely match or even exceed the performance of supervised pre-training on a variety of tasks such as object detection, surface normal estimation and visual navigation using reinforcement learning.
Self-supervised Spatiotemporal Feature Learning by Video Geometric Transformations
TLDR
A novel 3DConvNet-based fully selfsupervised framework to learn spatiotemporal video features without using any human-labeled annotations and outperforms the state-of-the-arts of fully self-supervised methods on both UCF101 and HMDB51 datasets and achieves 62.9% and 33.7% accuracy respectively.
Evolving Losses for Unsupervised Video Representation Learning
TLDR
An unsupervised representation evaluation metric is proposed using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law, which produces similar results to weakly-supervised, task-specific ones.
What Should Not Be Contrastive in Contrastive Learning
TLDR
This work introduces a contrastive learning framework which does not require prior knowledge of specific, task-dependent invariances, and learns to capture varying and invariant factors for visual representations by constructing separate embedding spaces, each of which is invariant to all but one augmentation.
Self-Supervised Visual Representations for Cross-Modal Retrieval
TLDR
The experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like classification, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset.
Contrastive Learning of Image Representations with Cross-Video Cycle-Consistency
TLDR
A novel contrastive learning method which explores the cross-video relation by using cycle-consistency for general image representation learning and allows to collect positive sample pairs across different video instances, which it is hypothesized will lead to higher-level semantics.
...
...

References

SHOWING 1-10 OF 67 REFERENCES
Unsupervised Visual Representation Learning by Context Prediction
TLDR
It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset.
Building high-level features using large scale unsupervised learning
TLDR
Contrary to what appears to be a widely-held intuition, the experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not.
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
Colorization as a Proxy Task for Visual Understanding
TLDR
This work investigates and improves self-supervision as a drop-in replacement for ImageNet pretraining, focusing on automatic colorization as the proxy task, and presents the first in-depth analysis of self- supervision via colorization, concluding that formulation of the loss, training details and network architecture play important roles in its effectiveness.
Learning to See by Moving
TLDR
It is found that using the same number of training images, features learnt using egomotion as supervision compare favourably to features learnt with class-label as supervision on the tasks of scene recognition, object recognition, visual odometry and keypoint matching.
Learning Image Representations Tied to Ego-Motion
TLDR
This work proposes to exploit proprioceptive motor signals to provide unsupervised regularization in convolutional neural networks to learn visual representations from egocentric video to enforce that the authors' learned features exhibit equivariance, i.e, they respond predictably to transformations associated with distinct ego-motions.
Unsupervised Discovery of Mid-Level Discriminative Patches
TLDR
The paper experimentally demonstrates the effectiveness of discriminative patches as an unsupervised mid-level visual representation, suggesting that it could be used in place of visual words for many tasks.
Learning Features by Watching Objects Move
TLDR
Inspired by the human visual system, low-level motion-based grouping cues can be used to learn an effective visual representation that significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.
Computational Baby Learning
TLDR
A computational model for slightly-supervised object detection, based on prior knowledge modelling, exemplar learning and learning with video contexts, that can beat the state-of-the-art full-training based performances by learning from very few samples for each object category, along with about 20,000 unlabeled videos.
Unsupervised Learning of Edges
TLDR
This work presents a simple yet effective approach for training edge detectors without human supervision, and shows that when using a deep network for the edge detector, this approach provides a novel pre-training scheme for object detection.
...
...