• Corpus ID: 234482727

Contrastive Learning of Image Representations with Cross-Video Cycle-Consistency

  title={Contrastive Learning of Image Representations with Cross-Video Cycle-Consistency},
  author={Haiping Wu and Xiaolong Wang},
Recent works have advanced the performance of selfsupervised representation learning by a large margin. The core among these methods is intra-image invariance learning. Two different transformations of one image instance are considered as a positive sample pair, where various tasks are designed to learn invariant representations by comparing the pair. Analogically, for video data, representations of frames from the same video are trained to be closer than frames from other videos, i.e… 
Targeted Supervised Contrastive Learning for Long-Tailed Recognition
Targeted supervised contrastive learning (TSC) is proposed, which improves the uniformity of the feature distribution on the hypersphere and achieves state-of-the-art performance on long-tailed recognition tasks.
Mind Your Clever Neighbours: Unsupervised Person Re-identification via Adaptive Clustering Relationship Modeling
A novel clustering relationship modeling framework for unsupervised person Re-ID where the relation between unlabeled images is explored based on a graph correlation learning (GCL) module and the refined features are used for clustering to generate high-quality pseudo-labels.


Cycle-Contrast for Self-Supervised Video Representation Learning
It is demonstrated that the video representation learned by CCL can be transferred well to downstream tasks of video understanding, outperforming previous methods in nearest neighbour retrieval and action recognition tasks on UCF101, HMDB51 and MMAct.
Watching the World Go By: Representation Learning from Unlabeled Videos
Video Noise Contrastive Estimation is proposed, a method for using unlabeled video to learn strong, transferable single image representations that demonstrate improvements over recent unsupervised single image techniques, as well as over fully supervised ImageNet pretraining, across a variety of temporal and non-temporal tasks.
Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases
This work demonstrates that approaches like MOCO and PIRL learn occlusion-invariant representations, but they fail to capture viewpoint and category instance invariance which are crucial components for object recognition, and proposes an approach to leverage unstructured videos to learn representations that possess higher viewpoint invariance.
Self-supervised Video Representation Learning by Pace Prediction
This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction -- by introducing contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content.
Self-supervised Learning for Video Correspondence Flow
A simple information bottleneck is introduced that forces the model to learn robust features for correspondence matching, and prevents it from learning trivial solutions, as well as probing the upper bound by training on additional data, further demonstrating significant improvements on video segmentation.
Memory-augmented Dense Predictive Coding for Video Representation Learning
A new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) is proposed for the self-supervised learning from video, in particular for representations for action recognition, trained with a predictive attention mechanism over the set of compressed memories.
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
Transitive Invariance for Self-Supervised Visual Representation Learning
This paper proposes to generate a graph with millions of objects mined from hundreds of thousands of videos and argues to organize and reason the data with multiple variations to exploit different self-supervised approaches to learn representations invariant to inter-instance variations.
Video Representation Learning by Recognizing Temporal Transformations
This work promotes an accurate learning of motion without human annotation by training a neural network to discriminate a video sequence from its temporally transformed versions by introducing the following transformations: forward-backward playback, random frame skipping, and uniform frame skipping.
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles
This paper introduces a new self-supervised task called Space-Time Cubic Puzzles, which requires a network to arrange permuted 3D spatio-temporal crops and learns both spatial appearance and temporal relation of video frames, which is the final goal.