A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

@article{Feichtenhofer2021ALS,
  title={A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning},
  author={Christoph Feichtenhofer and Haoqi Fan and Bo Xiong and Ross B. Girshick and Kaiming He},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={3298-3308}
}
We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv… 
Exploring Temporal Granularity in Self-Supervised Video Representation Learning
TLDR
This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations, and reveals the impact of temporal granularity with three major findings.
Masked Autoencoders As Spatiotemporal Learners
TLDR
It is shown that the MAE method can learn strong representations with almost no inductive bias on spacetime, and spacetime- agnostic random masking performs the best, and the general framework of masked autoencoding can be a unified methodology for representation learning with minimal domain knowledge.
Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision
TLDR
This paper designs a region-based pretext task which requires the model to transform instance representations from one view to another, guided by context features, and introduces a simple network design that successfully reconciles the simultaneous learning process of both holistic and local representations.
Self-supervised Spatiotemporal Representation Learning by Exploiting Video Continuity
TLDR
This work formulate three novel continuity-related pretext tasks, i.e. continuity justification, discontinuity localization, and missing section approximation, that jointly supervise a shared backbone for video representation learning and encourages the backbone network to learn local and long-ranged motion and context representations.
The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning
TLDR
A contrastive framework to learn audiovisual representations from unlabeled videos is presented; lossy spatio-temporal transformations that do not corrupt the temporal coherency of videos are the most effective and scales with higher temporal resolution and stronger transformation intensity.
Self-supervised Video Transformer
TLDR
This approach is the first to alleviate the dependency on negative samples or dedicated memory banks in Self-supervised Video Transformer (SVT), and supports slow-fast video processing within a single architecture using dynamically adjusted positional encoding and supports long-term relationship modeling along spatiotemporal dimensions.
iBoot: Image-bootstrapped Self-Supervised Video Representation Learning
TLDR
The typical video-based SSL design and objective is modified to encourage the video encoder to subsume the semantic content of an image-based model trained on a general domain, enabling the model to learn strong spatial and temporal information without relying on the video labeled data.
Controllable Augmentations for Video Representation Learning
TLDR
This paper proposes a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as general long-term temporal rela-tions, and introduces local-global temporal order dependency to further bridge the gap between clip-level and video-level representations for robust temporal modeling.
Long-Short Temporal Contrastive Learning of Video Transformers
TLDR
It is empirically demonstrated that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K.
Less than Few: Self-Shot Video Instance Segmentation
TLDR
This work proposes to automatically learn to find appropriate support videos given a query to bypass the need for labelled examples in few-shot video understanding at run time, and outlines a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples.
...
...

References

SHOWING 1-10 OF 101 REFERENCES
Video Representation Learning by Dense Predictive Coding
TLDR
With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 and HMDB51, outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.
Spatiotemporal Contrastive Video Representation Learning
TLDR
This work proposes a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames, and proposes a sampling-based temporal augmentation methods to avoid overly enforcing invariance on clips that are distant in time.
Local Aggregation for Unsupervised Learning of Visual Embeddings
TLDR
This work describes a method that trains an embedding function to maximize a metric of local aggregation, causing similar data instances to move together in the embedding space, while allowing dissimilar instances to separate.
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
TLDR
This paper proposes an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons, and uses a swapped prediction mechanism where it predicts the cluster assignment of a view from the representation of another view.
Evolving Losses for Unsupervised Video Representation Learning
TLDR
An unsupervised representation evaluation metric is proposed using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law, which produces similar results to weakly-supervised, task-specific ones.
Watching the World Go By: Representation Learning from Unlabeled Videos
TLDR
Video Noise Contrastive Estimation is proposed, a method for using unlabeled video to learn strong, transferable single image representations that demonstrate improvements over recent unsupervised single image techniques, as well as over fully supervised ImageNet pretraining, across a variety of temporal and non-temporal tasks.
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
Self-Supervised MultiModal Versatile Networks
TLDR
This work learns representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language by incorporating a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image.
Unsupervised Learning of Spatiotemporally Coherent Metrics
TLDR
This work focuses on feature learning from unlabeled video data, using the assumption that adjacent video frames contain semantically similar information, and establishes a connection between slow feature learning and metric learning.
Learning Spatiotemporal Features via Video and Text Pair Discrimination
TLDR
A general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text, which yields a remarkable performance gain for action recognition on UCF101 and HMDB51 compared with the existing state-of-the-art self-supervised training methods.
...
...