Video Transformers: A Survey

  title={Video Transformers: A Survey},
  author={Javier Selva and Anders S. Johansen and Sergio Escalera and Kamal Nasrollahi and Thomas Baltzer Moeslund and Albert Clap'es},
Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced by the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey, we analyze the… 

Figures and Tables from this paper

Transformers in Time Series: A Survey

This paper systematically review Transformer schemes for time series modeling by highlighting their strengths as well as limitations, and is the first work to comprehensively and systematically summarize the recent advances of Transformers for modeling time series data.

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

A comprehensive survey of large-scale pre-trained multi-modal big models with a focus on data, objectives, network architectures, and knowledge enhanced pre-training.

Spatiotemporal Decouple-and-Squeeze Contrastive Learning for Semi-Supervised Skeleton-based Action Recognition

This work proposes a novel Spatiotemporal Decouple-and-Squeeze Contrastive Learning (SDS-CL) framework to comprehensively learn more abundant representations of skeleton-based actions by jointly contrasting spatial-squeezing features, temporal-squ squeeze features, and global features.

Object-ABN: Learning to Generate Sharp Attention Maps for Action Recognition

This paper proposes an extension of the Attention Branch Network by using instance segmentation for generating sharper attention maps for action recognition by introducing a new mask loss that makes the generated attention maps close to the instance segmentations result.

Recur, Attend or Convolve? On Whether Temporal Modeling Matters for Cross-Domain Robustness in Action Recognition

The combined results of the experiments indicate that sound physical inductive bias such as recurrence in temporal modeling may be advantageous when robustness to domain shift is important for the task.

Review of Typical Vehicle Detection Algorithms Based on Deep Learning

The advantages and disadvantages of several representative algorithm models are introduced, and a summary and prospect of the research of object detection algorithm based on Transformer gradually causes a boom.

Use of Vision Transformers in Deep Learning Applications

An area of undeveloped but highly crucial topic of study namely multi-sensory data stream handling and current challenges that could incite research is outlined.

Neural Architecture Search for Transformers: A Survey

An in-depth literature review of approximately 50 state-of-the-art Neural Architecture Search methods is provided, targeting the Transformer model and its family of architectures such as Bidirectional Encoder Representations from Transformers (BERT) and Vision Transformers.

Less is More: Facial Landmarks can Recognize a Spontaneous Smile

A MeshSmileNet framework, a transformer architecture, to address the above limitations and achieve state-of-the-art performances on UVA-NEMO, BBC, MMI Facial Expression, and SPOS datasets.



Self-Supervised Learning for Videos: A Survey

This survey provides a review of existing approaches on self-supervised learning focusing on the video domain and summarizes these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement.

TokenLearner: Adaptive Space-Time Tokenization for Videos

A novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks, which accomplishes competitive results at significantly reduced computational cost.

AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing

This comprehensive survey paper explains various core concepts like pretraining, Pretraining methods, pretraining tasks, embeddings and downstream adaptation methods, presents a new taxonomy of T-PTLMs and gives brief overview of various benchmarks including both intrinsic and extrinsic.

Space-time Mixing Attention for Video Transformer

This work proposes a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Trans transformer model and shows how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

Recurring the Transformer for Video Action Recognition

A novel Recurrent Vision Transformer framework based on spatial-temporal representation learning to achieve the video action recognition task, equipped with an attention gate to build interaction between current frame input and previous hidden state, thus aggregating the global level interframe features through the hidden state temporally.

Cross-Architecture Self-supervised Video Representation Learning

This paper introduces a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences in the temporal order, which enables the model to learn a rich temporal representation that compensates strongly to the video-level representation learned by the CACL.

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

This paper shows that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP) and shows that data quality is more important than data quantity for SSVP.

DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition

This work presents a novel end-to-end Transformer-based Directed Attention (Direc-Former) framework that consistently achieves the state-of-the-art (SOTA) results compared with the recent action recognition methods.

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

A novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution and spatiotemporal self-attention in a concise transformer format, and achieves a preferable balance between computation and accuracy.