A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
@article{Feichtenhofer2021ALS, title={A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning}, author={Christoph Feichtenhofer and Haoqi Fan and Bo Xiong and Ross B. Girshick and Kaiming He}, journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2021}, pages={3298-3308} }
We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv…
Figures and Tables from this paper
73 Citations
Exploring Temporal Granularity in Self-Supervised Video Representation Learning
- Computer ScienceArXiv
- 2021
This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations, and reveals the impact of temporal granularity with three major findings.
Masked Autoencoders As Spatiotemporal Learners
- Computer ScienceArXiv
- 2022
It is shown that the MAE method can learn strong representations with almost no inductive bias on spacetime, and spacetime- agnostic random masking performs the best, and the general framework of masked autoencoding can be a unified methodology for representation learning with minimal domain knowledge.
Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision
- Computer ScienceArXiv
- 2021
This paper designs a region-based pretext task which requires the model to transform instance representations from one view to another, guided by context features, and introduces a simple network design that successfully reconciles the simultaneous learning process of both holistic and local representations.
Self-supervised Spatiotemporal Representation Learning by Exploiting Video Continuity
- Computer ScienceArXiv
- 2021
This work formulate three novel continuity-related pretext tasks, i.e. continuity justification, discontinuity localization, and missing section approximation, that jointly supervise a shared backbone for video representation learning and encourages the backbone network to learn local and long-ranged motion and context representations.
The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning
- Computer ScienceArXiv
- 2021
A contrastive framework to learn audiovisual representations from unlabeled videos is presented; lossy spatio-temporal transformations that do not corrupt the temporal coherency of videos are the most effective and scales with higher temporal resolution and stronger transformation intensity.
Self-supervised Video Transformer
- Computer ScienceArXiv
- 2021
This approach is the first to alleviate the dependency on negative samples or dedicated memory banks in Self-supervised Video Transformer (SVT), and supports slow-fast video processing within a single architecture using dynamically adjusted positional encoding and supports long-term relationship modeling along spatiotemporal dimensions.
iBoot: Image-bootstrapped Self-Supervised Video Representation Learning
- Computer ScienceArXiv
- 2022
The typical video-based SSL design and objective is modified to encourage the video encoder to subsume the semantic content of an image-based model trained on a general domain, enabling the model to learn strong spatial and temporal information without relying on the video labeled data.
Controllable Augmentations for Video Representation Learning
- Computer ScienceArXiv
- 2022
This paper proposes a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as general long-term temporal rela-tions, and introduces local-global temporal order dependency to further bridge the gap between clip-level and video-level representations for robust temporal modeling.
Long-Short Temporal Contrastive Learning of Video Transformers
- Computer ScienceArXiv
- 2021
It is empirically demonstrated that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K.
Less than Few: Self-Shot Video Instance Segmentation
- Computer ScienceArXiv
- 2022
This work proposes to automatically learn to find appropriate support videos given a query to bypass the need for labelled examples in few-shot video understanding at run time, and outlines a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples.
References
SHOWING 1-10 OF 101 REFERENCES
Video Representation Learning by Dense Predictive Coding
- Computer Science2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)
- 2019
With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 and HMDB51, outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.
Spatiotemporal Contrastive Video Representation Learning
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
This work proposes a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames, and proposes a sampling-based temporal augmentation methods to avoid overly enforcing invariance on clips that are distant in time.
Local Aggregation for Unsupervised Learning of Visual Embeddings
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
This work describes a method that trains an embedding function to maximize a metric of local aggregation, causing similar data instances to move together in the embedding space, while allowing dissimilar instances to separate.
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
- Computer ScienceNeurIPS
- 2020
This paper proposes an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons, and uses a swapped prediction mechanism where it predicts the cluster assignment of a view from the representation of another view.
Evolving Losses for Unsupervised Video Representation Learning
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
An unsupervised representation evaluation metric is proposed using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law, which produces similar results to weakly-supervised, task-specific ones.
Watching the World Go By: Representation Learning from Unlabeled Videos
- Computer ScienceArXiv
- 2020
Video Noise Contrastive Estimation is proposed, a method for using unlabeled video to learn strong, transferable single image representations that demonstrate improvements over recent unsupervised single image techniques, as well as over fully supervised ImageNet pretraining, across a variety of temporal and non-temporal tasks.
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
- Computer ScienceECCV
- 2016
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
Self-Supervised MultiModal Versatile Networks
- Computer ScienceNeurIPS
- 2020
This work learns representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language by incorporating a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image.
Unsupervised Learning of Spatiotemporally Coherent Metrics
- Computer Science2015 IEEE International Conference on Computer Vision (ICCV)
- 2015
This work focuses on feature learning from unlabeled video data, using the assumption that adjacent video frames contain semantically similar information, and establishes a connection between slow feature learning and metric learning.
Learning Spatiotemporal Features via Video and Text Pair Discrimination
- Computer ScienceArXiv
- 2020
A general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text, which yields a remarkable performance gain for action recognition on UCF101 and HMDB51 compared with the existing state-of-the-art self-supervised training methods.