Video Transformers: A Survey
@article{Selva2022VideoTA, title={Video Transformers: A Survey}, author={Javier Selva and Anders S. Johansen and Sergio Escalera and Kamal Nasrollahi and Thomas Baltzer Moeslund and Albert Clap'es}, journal={ArXiv}, year={2022}, volume={abs/2201.05991} }
Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced by the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey, we analyze the…
16 Citations
Transformers in Time Series: A Survey
- Computer ScienceArXiv
- 2022
This paper systematically review Transformer schemes for time series modeling by highlighting their strengths as well as limitations, and is the first work to comprehensively and systematically summarize the recent advances of Transformers for modeling time series data.
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
- Computer Science
- 2023
A comprehensive survey of large-scale pre-trained multi-modal big models with a focus on data, objectives, network architectures, and knowledge enhanced pre-training.
Spatiotemporal Decouple-and-Squeeze Contrastive Learning for Semi-Supervised Skeleton-based Action Recognition
- Computer ScienceIEEE Transactions on Neural Networks and Learning Systems
- 2023
This work proposes a novel Spatiotemporal Decouple-and-Squeeze Contrastive Learning (SDS-CL) framework to comprehensively learn more abundant representations of skeleton-based actions by jointly contrasting spatial-squeezing features, temporal-squ squeeze features, and global features.
Object-ABN: Learning to Generate Sharp Attention Maps for Action Recognition
- Computer ScienceIEICE Trans. Inf. Syst.
- 2023
This paper proposes an extension of the Attention Branch Network by using instance segmentation for generating sharper attention maps for action recognition by introducing a new mask loss that makes the generated attention maps close to the instance segmentations result.
Recur, Attend or Convolve? On Whether Temporal Modeling Matters for Cross-Domain Robustness in Action Recognition
- Computer Science2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
- 2023
The combined results of the experiments indicate that sound physical inductive bias such as recurrence in temporal modeling may be advantageous when robustness to domain shift is important for the task.
Review of Typical Vehicle Detection Algorithms Based on Deep Learning
- Computer ScienceJournal of Engineering Research and Reports
- 2022
The advantages and disadvantages of several representative algorithm models are introduced, and a summary and prospect of the research of object detection algorithm based on Transformer gradually causes a boom.
Use of Vision Transformers in Deep Learning Applications
- Computer Science
- 2022
An area of undeveloped but highly crucial topic of study namely multi-sensory data stream handling and current challenges that could incite research is outlined.
Neural Architecture Search for Transformers: A Survey
- Computer ScienceIEEE Access
- 2022
An in-depth literature review of approximately 50 state-of-the-art Neural Architecture Search methods is provided, targeting the Transformer model and its family of architectures such as Bidirectional Encoder Representations from Transformers (BERT) and Vision Transformers.
Less is More: Facial Landmarks can Recognize a Spontaneous Smile
- Computer ScienceBMVC
- 2022
A MeshSmileNet framework, a transformer architecture, to address the above limitations and achieve state-of-the-art performances on UVA-NEMO, BBC, MMI Facial Expression, and SPOS datasets.
References
SHOWING 1-10 OF 285 REFERENCES
Self-Supervised Learning for Videos: A Survey
- Computer ScienceACM Computing Surveys
- 2022
This survey provides a review of existing approaches on self-supervised learning focusing on the video domain and summarizes these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement.
TokenLearner: Adaptive Space-Time Tokenization for Videos
- Computer ScienceNeurIPS
- 2021
A novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks, which accomplishes competitive results at significantly reduced computational cost.
AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing
- Computer ScienceArXiv
- 2021
This comprehensive survey paper explains various core concepts like pretraining, Pretraining methods, pretraining tasks, embeddings and downstream adaptation methods, presents a new taxonomy of T-PTLMs and gives brief overview of various benchmarks including both intrinsic and extrinsic.
Space-time Mixing Attention for Video Transformer
- Computer ScienceNeurIPS
- 2021
This work proposes a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Trans transformer model and shows how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost.
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.
Recurring the Transformer for Video Action Recognition
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
A novel Recurrent Vision Transformer framework based on spatial-temporal representation learning to achieve the video action recognition task, equipped with an attention gate to build interaction between current frame input and previous hidden state, thus aggregating the global level interframe features through the hidden state temporally.
Cross-Architecture Self-supervised Video Representation Learning
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
This paper introduces a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences in the temporal order, which enables the model to learn a rich temporal representation that compensates strongly to the video-level representation learned by the CACL.
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
- Computer ScienceArXiv
- 2022
This paper shows that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP) and shows that data quality is more important than data quantity for SSVP.
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
This work presents a novel end-to-end Transformer-based Directed Attention (Direc-Former) framework that consistently achieves the state-of-the-art (SOTA) results compared with the recent action recognition methods.
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
- Computer ScienceICLR
- 2022
A novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution and spatiotemporal self-attention in a concise transformer format, and achieves a preferable balance between computation and accuracy.