• Corpus ID: 235485193

Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

  title={Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting},
  author={Martine Toering and Ioannis Gatopoulos and Maarten Stol and Vincent Tao Hu},
Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning. They are not suitable for exploiting the rich dynamical structure of video however, as operations are done on many augmented instances. In this paper we propose “Video Cross-Stream Prototypical Contrasting”, a novel method which predicts consistent prototype assignments from both RGB and optical flow views… 


Representation Learning with Video Deep InfoMax
This paper finds that drawing views from both natural-rate sequences and temporally-downsampled sequences yields results on Kinetics-pretrained action recognition tasks which match or outperform prior state-of-the-art methods that use more costly large-time-scale transformer models.
Cycle-Contrast for Self-Supervised Video Representation Learning
It is demonstrated that the video representation learned by CCL can be transferred well to downstream tasks of video understanding, outperforming previous methods in nearest neighbour retrieval and action recognition tasks on UCF101, HMDB51 and MMAct.
Memory-augmented Dense Predictive Coding for Video Representation Learning
A new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) is proposed for the self-supervised learning from video, in particular for representations for action recognition, trained with a predictive attention mechanism over the set of compressed memories.
Evolving Losses for Unsupervised Video Representation Learning
An unsupervised representation evaluation metric is proposed using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law, which produces similar results to weakly-supervised, task-specific ones.
Cross Pixel Optical Flow Similarity for Self-Supervised Learning
This work uses motion cues in the form of optical flow, to supervise representations of static images, and achieves state-of-the-art results in self-supervision using motion cues, competitive results for self- supervision in general, and is overall state of the art inSelf-supervised pretraining for semantic image segmentation.
Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction.
With the self-supervised pre-trained 3DRotNet from large datasets, the recognition accuracy is boosted up by 20.4% on UCF101 and 16.7% on HMDB51 respectively, compared to the models trained from scratch.
Unsupervised Learning of Video Representations via Dense Trajectory Clustering
This paper proposes to adapt two top performing objectives in this class - instance recognition and local aggregation, to the video domain, and forms clusters in the IDT space, using heuristic-based IDT descriptors as a an unsupervised prior in the iterative local aggregation algorithm.
Self-supervised Video Representation Learning by Pace Prediction
This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction -- by introducing contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content.
Self-supervised Learning for Video Correspondence Flow
A simple information bottleneck is introduced that forces the model to learn robust features for correspondence matching, and prevents it from learning trivial solutions, as well as probing the upper bound by training on additional data, further demonstrating significant improvements on video segmentation.
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles
This paper introduces a new self-supervised task called Space-Time Cubic Puzzles, which requires a network to arrange permuted 3D spatio-temporal crops and learns both spatial appearance and temporal relation of video frames, which is the final goal.