Corpus ID: 235485193

Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

@article{Toering2021SelfsupervisedVR,
  title={Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting},
  author={Martine Toering and Ioannis Gatopoulos and Maarten Stol and Vincent Tao Hu},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.10137}
}
Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning. They are not suitable for exploiting the rich dynamical structure of video however, as operations are done on many augmented instances. In this paper we propose “Video Cross-Stream Prototypical Contrasting”, a novel method which predicts consistent prototype assignments from both RGB and optical flow views… Expand

References

SHOWING 1-10 OF 87 REFERENCES
Memory-augmented Dense Predictive Coding for Video Representation Learning
TLDR
A new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) is proposed for the self-supervised learning from video, in particular for representations for action recognition, trained with a predictive attention mechanism over the set of compressed memories. Expand
Evolving Losses for Unsupervised Video Representation Learning
TLDR
An unsupervised representation evaluation metric is proposed using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law, which produces similar results to weakly-supervised, task-specific ones. Expand
Cross Pixel Optical Flow Similarity for Self-Supervised Learning
TLDR
This work uses motion cues in the form of optical flow, to supervise representations of static images, and achieves state-of-the-art results in self-supervision using motion cues, competitive results for self- supervision in general, and is overall state of the art inSelf-supervised pretraining for semantic image segmentation. Expand
Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction.
TLDR
With the self-supervised pre-trained 3DRotNet from large datasets, the recognition accuracy is boosted up by 20.4% on UCF101 and 16.7% on HMDB51 respectively, compared to the models trained from scratch. Expand
Self-supervised Video Representation Learning by Pace Prediction
TLDR
This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction -- by introducing contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content. Expand
Self-supervised Learning for Video Correspondence Flow
TLDR
A simple information bottleneck is introduced that forces the model to learn robust features for correspondence matching, and prevents it from learning trivial solutions, as well as probing the upper bound by training on additional data, further demonstrating significant improvements on video segmentation. Expand
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles
TLDR
This paper introduces a new self-supervised task called Space-Time Cubic Puzzles, which requires a network to arrange permuted 3D spatio-temporal crops and learns both spatial appearance and temporal relation of video frames, which is the final goal. Expand
Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction
TLDR
A self-supervised spatiotemporal learning technique which leverages the chronological order of videos to learn the spatiotmporal representation of the video by predicting the order of shuffled clips from the video. Expand
Unsupervised Learning From Video With Deep Neural Embeddings
TLDR
The Video Instance Embedding framework, which trains deep nonlinear embeddings on video sequence inputs, is presented, showing that a two-pathway model with both static and dynamic processing paths is optimal, and the results suggest that deep neuralembeddings are a promising approach to unsupervised video learning for a wide variety of task domains. Expand
Video Representation Learning by Dense Predictive Coding
TLDR
With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 and HMDB51, outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet. Expand
...
1
2
3
4
5
...