Corpus ID: 235458509

MoDist: Motion Distillation for Self-supervised Video Representation Learning

  title={MoDist: Motion Distillation for Self-supervised Video Representation Learning},
  author={Fanyi Xiao and Joseph Tighe and Davide Modolo},
We present MoDist as a novel method to explicitly distill motion information into self-supervised video representations. Compared to previous video representation learning methods that mostly focus on learning motion cues implicitly from RGB inputs, we show that the representation learned with our MoDist method focus more on foreground motion regions and thus generalizes better to downstream tasks. To achieve this, MoDist enriches standard contrastive learning objectives for RGB video clips… Expand

Figures and Tables from this paper


DynamoNet: Dynamic Action and Motion Network
A novel unified spatio-temporal 3D-CNN architecture (DynamoNet) that jointly optimizes the video classification and learning motion representation by predicting future frames as a multi-task learning problem is introduced. Expand
Video Representation Learning by Dense Predictive Coding
With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 and HMDB51, outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet. Expand
Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
This paper proposes to learn visual features by regressing both motion and appearance statistics along spatial and temporal dimensions, given only the input video data, and shows that the approach can significantly improve the performance of C3D when applied to video classification tasks. Expand
Memory-augmented Dense Predictive Coding for Video Representation Learning
A new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) is proposed for the self-supervised learning from video, in particular for representations for action recognition, trained with a predictive attention mechanism over the set of compressed memories. Expand
Evolving Losses for Unsupervised Video Representation Learning
An unsupervised representation evaluation metric is proposed using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law, which produces similar results to weakly-supervised, task-specific ones. Expand
Self-supervised Video Representation Learning by Context and Motion Decoupling
This work develops a method that explicitly decouples motion supervision from context bias through a carefully designed pretext task that improves the quality of the learned video representation and finds the motion prediction to be a strong regularization for video networks. Expand
Self-supervised Spatiotemporal Feature Learning by Video Geometric Transformations
A novel 3DConvNet-based fully selfsupervised framework to learn spatiotemporal video features without using any human-labeled annotations and outperforms the state-of-the-arts of fully self-supervised methods on both UCF101 and HMDB51 datasets and achieves 62.9% and 33.7% accuracy respectively. Expand
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose. Expand
Cross Pixel Optical Flow Similarity for Self-Supervised Learning
This work uses motion cues in the form of optical flow, to supervise representations of static images, and achieves state-of-the-art results in self-supervision using motion cues, competitive results for self- supervision in general, and is overall state of the art inSelf-supervised pretraining for semantic image segmentation. Expand
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles
This paper introduces a new self-supervised task called Space-Time Cubic Puzzles, which requires a network to arrange permuted 3D spatio-temporal crops and learns both spatial appearance and temporal relation of video frames, which is the final goal. Expand