Is Space-Time Attention All You Need for Video Understanding?
@inproceedings{Bertasius2021IsSA, title={Is Space-Time Attention All You Need for Video Understanding?}, author={Gedas Bertasius and Heng Wang and Lorenzo Torresani}, booktitle={ICML}, year={2021} }
We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named “TimeSformer,” adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of framelevel patches. Our experimental study compares different self-attention schemes and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads…
Figures from this paper
300 Citations
ViViT: A Video Vision Transformer
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.
VideoLightFormer: Lightweight Action Recognition using Transformers
- Computer ScienceArXiv
- 2021
This work proposes a novel, lightweight action recognition architecture, VideoLightFormer, which carefully extends the 2D convolutional Temporal Segment Network with transformers, while maintaining spatial and temporal video structure throughout the entire model.
Space-time Mixing Attention for Video Transformer
- Computer ScienceNeurIPS
- 2021
This work proposes a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Trans transformer model and shows how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost.
Video Swin Transformer
- Computer ScienceArXiv
- 2021
The proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models, and achieves state-of-the-art accuracy on a broad range of video recognition benchmarks.
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
Space-Time Crop & Attend (STiCA) is introduced, a method to simulate spatial augmentations much more efficiently directly in feature space, and transformer-based attention improves performance significantly, and is well suited for processing feature crops.
Long-Short Temporal Contrastive Learning of Video Transformers
- Computer ScienceArXiv
- 2021
It is empirically demonstrated that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K.
VidTr: Video Transformer Without Convolutions
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
Video Transformer with separable-attention for video classification is introduced with standard deviation based topK pooling for attention (pooltopK_std), which reduces the computation by dropping non-informative features along temporal dimension.
Exploring Stronger Feature for Temporal Action Localization
- Computer ScienceArXiv
- 2021
The transformer-based methods can achieve better classification performance than convolution-based, but they cannot generate accuracy action proposals and extracting features with larger frame resolution to reduce the loss of spatial information can also effectively improve the performance of temporal action localization.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
- Computer ScienceNeurIPS
- 2021
The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and shows the generalizability of the model despite the domain gap between videos and images.
Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
This work proposes a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration, which enables the learning of long-range dependencies beyond a single clip and significantly improves the accuracy of video classification at a negligible computational overhead.
References
SHOWING 1-10 OF 81 REFERENCES
A2-Nets: Double Attention Networks
- Computer ScienceNeurIPS
- 2018
This work proposes the "double attention block", a novel component that aggregates and propagates informative global features from the entire spatio-temporal space of input images/videos, enabling subsequent convolution layers to access featuresFrom the entire space efficiently.
ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
A new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video and outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation
- Computer ScienceNeurIPS
- 2019
An lightweight and memory-friendly architecture for action recognition that performs on par with or better than current architectures by using only a fraction of resources, and a temporal aggregation module is proposed to model temporal dependencies in a video at very small additional computational costs.
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
- Computer ScienceECCV
- 2018
It is shown that it is possible to replace many of the 3D convolutions by low-cost 2D convolution, suggesting that temporal representation learning on high-level “semantic” features is more useful.
On the Relationship between Self-Attention and Convolutional Layers
- Computer ScienceICLR
- 2020
This work proves that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer, which provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice.
Attention Augmented Convolutional Networks
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar.
Long-Term Feature Banks for Detailed Video Understanding
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
This paper proposes a long-term feature bank—supportive information extracted over the entire span of a video—to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds.
Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification
- Computer ScienceArXiv
- 2017
Experimental results on the challenging Kinetics dataset demonstrate that the proposed temporal modeling approaches can significantly improve existing approaches in the large-scale video recognition tasks.
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
- Computer ScienceECCV
- 2020
This paper factorizes 2D self-attention into two 1Dself-attentions, a novel building block that one could stack to form axial-att attention models for image classification and dense prediction, and achieves state-of-the-art results on Mapillary Vistas and Cityscapes.