Unsupervised Representation Learning by Sorting Sequences

@article{Lee2017UnsupervisedRL,
  title={Unsupervised Representation Learning by Sorting Sequences},
  author={Hsin-Ying Lee and Jia-Bin Huang and Maneesh Kumar Singh and Ming-Hsuan Yang},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={667-676}
}
We present an unsupervised representation learning approach using videos without semantic labels. We leverage the temporal coherence as a supervisory signal by formulating representation learning as a sequence sorting task. We take temporally shuffled frames (i.e., in non-chronological order) as inputs and train a convolutional neural network to sort the shuffled sequences. Similar to comparison-based sorting algorithms, we propose to extract features from all frame pairs and aggregate them to… 
Unsupervised Learning of Visual Representations by Solving Shuffled Long Video-Frames Temporal Order Prediction
TLDR
A model for learning visual representation by solving order prediction task was proposed, which concatenated the frame pairs, instead of concatenating the feature pairs, and made it possible to apply a 3D-CNN to extract features from the frame Pair.
Unsupervised Learning of Video Representations via Dense Trajectory Clustering
TLDR
This paper proposes to adapt two top performing objectives in this class - instance recognition and local aggregation, to the video domain, and forms clusters in the IDT space, using heuristic-based IDT descriptors as a an unsupervised prior in the iterative local aggregation algorithm.
Video Representation Learning by Recognizing Temporal Transformations
TLDR
This work promotes an accurate learning of motion without human annotation by training a neural network to discriminate a video sequence from its temporally transformed versions by introducing the following transformations: forward-backward playback, random frame skipping, and uniform frame skipping.
Evolving Losses for Unsupervised Video Representation Learning
TLDR
An unsupervised representation evaluation metric is proposed using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law, which produces similar results to weakly-supervised, task-specific ones.
Representation Learning via Global Temporal Alignment and Cycle-Consistency
TLDR
A weakly supervised method for representation learning based on aligning temporal sequences of the same process as well as two applications of the temporal alignment framework, namely 3D pose reconstruction and fine-grained audio/visual retrieval.
SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning
TLDR
This paper uniquely regard the signals as the foundation in contrastive learning and derive a particular form named Sequence Contrastive Learning (SeCo), which shows superior results under the linear protocol on action recognition (Kinetics), untrimmed activity recognition (ActivityNet) and object tracking (OTB-100).
Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics
TLDR
This paper proposes a novel pretext task to address the self-supervised video representation learning problem, inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents.
Cycle-Contrast for Self-Supervised Video Representation Learning
TLDR
It is demonstrated that the video representation learned by CCL can be transferred well to downstream tasks of video understanding, outperforming previous methods in nearest neighbour retrieval and action recognition tasks on UCF101, HMDB51 and MMAct.
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning
TLDR
A new way to perceive the playback speed and exploit the relative speed between two video clips as labels is proposed to provide more effective and stable supervision for representation learning and ensure the learning of appearance features.
PreViTS: Contrastive Pretraining with Video Tracking Supervision
TLDR
This work proposes PreViTS, an SSL framework that utilizes an unsupervised tracking signal for selecting clips containing the same object, which helps better utilize temporal transformations of objects.
...
...

References

SHOWING 1-10 OF 47 REFERENCES
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
Unsupervised Visual Representation Learning by Context Prediction
TLDR
It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset.
Unsupervised Learning of Video Representations using LSTMs
TLDR
This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
TLDR
A novel unsupervised learning approach to build features suitable for object detection and classification and to facilitate the transfer of features to other tasks, the context-free network (CFN), a siamese-ennead convolutional neural network is introduced.
Context Encoders: Feature Learning by Inpainting
TLDR
It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
Unsupervised Visual Representation Learning by Graph-Based Consistent Constraints
TLDR
This paper proposes to use a cycle consistency criterion for mining positive pairs and geodesic distance in the graph for hard negative mining, and shows that the mined positive and negative image pairs can provide accurate supervisory signals for learning effective representations using Convolutional Neural Networks (CNNs).
Learning Image Representations Tied to Ego-Motion
TLDR
This work proposes to exploit proprioceptive motor signals to provide unsupervised regularization in convolutional neural networks to learn visual representations from egocentric video to enforce that the authors' learned features exhibit equivariance, i.e, they respond predictably to transformations associated with distinct ego-motions.
Learning Image Matching by Simply Watching Video
TLDR
An unsupervised learning based approach to the ubiquitous computer vision problem of image matching that achieves surprising performance comparable to traditional empirically designed methods.
Self-Supervised Video Representation Learning with Odd-One-Out Networks
TLDR
A new self-supervised CNN pre-training technique based on a novel auxiliary task called odd-one-out learning, which learns temporal representations for videos that generalizes to other related tasks such as action recognition.
Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning
TLDR
The results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.
...
...