Long-Short Temporal Contrastive Learning of Video Transformers

  title={Long-Short Temporal Contrastive Learning of Video Transformers},
  author={Jue Wang and Gedas Bertasius and Du Tran and Lorenzo Torresani},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with… 

Video Transformers: A Survey

This survey analyses and summarizes the main contributions and trends for adapting Transformers to model video data, and explores how videos are embedded and tokenized, finding a very widspread use of large CNN backbones to reduce dimensionality and a predominance of patches and frames as tokens.

Spatio-Temporal Crop Aggregation for Video Representation Learning

This work proposes Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that en-joys high scalability at both training and inference time, and demonstrates that its video representation yields state-of-the-art performance with linear, non-linear, and k -NN probing on common action classification datasets.

Exploring Temporal Granularity in Self-Supervised Video Representation Learning

This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations, and reveals the impact of temporal granularity with three major findings.

It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training

This work explicitly investigate motion cues in videos as extra prediction target and proposes an encoder-regressor-decoder pipeline for self-supervised video transformer pre-training with extensive experimental results prove that the method learns generalized video representations.

Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

The proposed Hierarchically Decoupled Spatial-Temporal Contrast (HDC) makes substantial improvements over directly learning spatial-temporal features as a whole and achieves competitive performance when compared with other state-of-the-art unsupervised methods.

Transformers in Vision: A Survey

This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline with an introduction to fundamental concepts behind the success of Transformers, i.e., self-attention, large-scale pre-training, and bidirectional feature encoding.

M3Video: Masked Motion Modeling for Self-Supervised Video Representation Learning

This paper presents a new self-supervised learning task, called Masked Motion Modeling ( M 3 Video), for learning representation by enforcing the model to predict the motion of moving objects in the masked regions and improves the accuracy when doing pre-training with 400 epochs.

Hierarchical Self-supervised Representation Learning for Movie Understanding

This paper proposes a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of the authors' hierarchical movie understanding model (based on [37], and demonstrates the effectiveness of the contextualized event features on LVU tasks.

Self-supervised Video-centralised Transformer for Video Face Clustering

Results show the performance of the video-centralised transformer has surpassed all previous state-of-the-art methods on both benchmarks, exhibiting a self-attentive understanding of face videos.

Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos

This work shows that it is possible to use a multi-modal model to tackle a task that it was not designed for, and may lead into the generalization of MDETR in additional downstream tasks.



ViViT: A Video Vision Transformer

This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.

Spatiotemporal Contrastive Video Representation Learning

This work proposes a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames, and proposes a sampling-based temporal augmentation methods to avoid overly enforcing invariance on clips that are distant in time.

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

The proposed Inter-Intra Contrastive (IIC) framework can train spatio-temporal convolutional networks to learn video representations and outperforms current state-of-the-art results by a large margin.

TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning

A novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL), which jointly models the inter- Snippet and intra-snippet temporal dependencies for temporal representation learning with a hybrid graph contrastive learning strategy is proposed.

Video Transformer Network

Inspired by recent developments in vision transformers, VTN is presented, a transformer-based framework for video recognition that enables whole video analysis, via a single end-to-end pass, while requiring 1.5× fewer GFLOPs.

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and shows the generalizability of the model despite the domain gap between videos and images.

Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction

A self-supervised spatiotemporal learning technique which leverages the chronological order of videos to learn the spatiotmporal representation of the video by predicting the order of shuffled clips from the video.

TEINet: Towards an Efficient Architecture for Video Recognition

The proposed TEINet can achieve a good recognition accuracy on these datasets but still preserve a high efficiency, and is able to capture temporal structure flexibly and effectively, but also efficient for model inference.

Is Space-Time Attention All You Need for Video Understanding?

This paper presents a convolution-free approach to video classification built exclusively on self-attention over space and time, and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered.

Video Representation Learning by Dense Predictive Coding

With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 and HMDB51, outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.