Long-Short Temporal Contrastive Learning of Video Transformers
@article{Wang2021LongShortTC, title={Long-Short Temporal Contrastive Learning of Video Transformers}, author={Jue Wang and Gedas Bertasius and Du Tran and Lorenzo Torresani}, journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2021}, pages={13990-14000} }
Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with…
Figures and Tables from this paper
20 Citations
Video Transformers: A Survey
- Computer ScienceArXiv
- 2022
This survey analyses and summarizes the main contributions and trends for adapting Transformers to model video data, and explores how videos are embedded and tokenized, finding a very widspread use of large CNN backbones to reduce dimensionality and a predominance of patches and frames as tokens.
Spatio-Temporal Crop Aggregation for Video Representation Learning
- Computer ScienceArXiv
- 2022
This work proposes Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that en-joys high scalability at both training and inference time, and demonstrates that its video representation yields state-of-the-art performance with linear, non-linear, and k -NN probing on common action classification datasets.
Exploring Temporal Granularity in Self-Supervised Video Representation Learning
- Computer ScienceArXiv
- 2021
This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations, and reveals the impact of temporal granularity with three major findings.
It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training
- Computer ScienceArXiv
- 2022
This work explicitly investigate motion cues in videos as extra prediction target and proposes an encoder-regressor-decoder pipeline for self-supervised video transformer pre-training with extensive experimental results prove that the method learns generalized video representations.
Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning
- Computer Science2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
- 2022
The proposed Hierarchically Decoupled Spatial-Temporal Contrast (HDC) makes substantial improvements over directly learning spatial-temporal features as a whole and achieves competitive performance when compared with other state-of-the-art unsupervised methods.
Transformers in Vision: A Survey
- Computer ScienceACM Comput. Surv.
- 2022
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline with an introduction to fundamental concepts behind the success of Transformers, i.e., self-attention, large-scale pre-training, and bidirectional feature encoding.
M3Video: Masked Motion Modeling for Self-Supervised Video Representation Learning
- Computer ScienceArXiv
- 2022
This paper presents a new self-supervised learning task, called Masked Motion Modeling ( M 3 Video), for learning representation by enforcing the model to predict the motion of moving objects in the masked regions and improves the accuracy when doing pre-training with 400 epochs.
Hierarchical Self-supervised Representation Learning for Movie Understanding
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
This paper proposes a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of the authors' hierarchical movie understanding model (based on [37], and demonstrates the effectiveness of the contextualized event features on LVU tasks.
Self-supervised Video-centralised Transformer for Video Face Clustering
- Computer ScienceArXiv
- 2022
Results show the performance of the video-centralised transformer has surpassed all previous state-of-the-art methods on both benchmarks, exhibiting a self-attentive understanding of face videos.
Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos
- Computer ScienceArXiv
- 2022
This work shows that it is possible to use a multi-modal model to tackle a task that it was not designed for, and may lead into the generalization of MDETR in additional downstream tasks.
References
SHOWING 1-10 OF 72 REFERENCES
ViViT: A Video Vision Transformer
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.
Spatiotemporal Contrastive Video Representation Learning
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
This work proposes a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames, and proposes a sampling-based temporal augmentation methods to avoid overly enforcing invariance on clips that are distant in time.
Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework
- Computer ScienceACM Multimedia
- 2020
The proposed Inter-Intra Contrastive (IIC) framework can train spatio-temporal convolutional networks to learn video representations and outperforms current state-of-the-art results by a large margin.
TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning
- Computer ScienceIEEE Transactions on Image Processing
- 2022
A novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL), which jointly models the inter- Snippet and intra-snippet temporal dependencies for temporal representation learning with a hybrid graph contrastive learning strategy is proposed.
Video Transformer Network
- Computer Science2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
- 2021
Inspired by recent developments in vision transformers, VTN is presented, a transformer-based framework for video recognition that enables whole video analysis, via a single end-to-end pass, while requiring 1.5× fewer GFLOPs.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
- Computer ScienceNeurIPS
- 2021
The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and shows the generalizability of the model despite the domain gap between videos and images.
Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
A self-supervised spatiotemporal learning technique which leverages the chronological order of videos to learn the spatiotmporal representation of the video by predicting the order of shuffled clips from the video.
TEINet: Towards an Efficient Architecture for Video Recognition
- Computer ScienceAAAI
- 2020
The proposed TEINet can achieve a good recognition accuracy on these datasets but still preserve a high efficiency, and is able to capture temporal structure flexibly and effectively, but also efficient for model inference.
Is Space-Time Attention All You Need for Video Understanding?
- Computer ScienceICML
- 2021
This paper presents a convolution-free approach to video classification built exclusively on self-attention over space and time, and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered.
Video Representation Learning by Dense Predictive Coding
- Computer Science2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)
- 2019
With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 and HMDB51, outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.