• Corpus ID: 231861462

Is Space-Time Attention All You Need for Video Understanding?

@inproceedings{Bertasius2021IsSA,
  title={Is Space-Time Attention All You Need for Video Understanding?},
  author={Gedas Bertasius and Heng Wang and Lorenzo Torresani},
  booktitle={ICML},
  year={2021}
}
We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named “TimeSformer,” adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of framelevel patches. Our experimental study compares different self-attention schemes and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads… 

Figures from this paper

ViViT: A Video Vision Transformer
TLDR
This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.
VideoLightFormer: Lightweight Action Recognition using Transformers
TLDR
This work proposes a novel, lightweight action recognition architecture, VideoLightFormer, which carefully extends the 2D convolutional Temporal Segment Network with transformers, while maintaining spatial and temporal video structure throughout the entire model.
Space-time Mixing Attention for Video Transformer
TLDR
This work proposes a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Trans transformer model and shows how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost.
Video Swin Transformer
TLDR
The proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models, and achieves state-of-the-art accuracy on a broad range of video recognition benchmarks.
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning
TLDR
Space-Time Crop & Attend (STiCA) is introduced, a method to simulate spatial augmentations much more efficiently directly in feature space, and transformer-based attention improves performance significantly, and is well suited for processing feature crops.
Long-Short Temporal Contrastive Learning of Video Transformers
TLDR
It is empirically demonstrated that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K.
VidTr: Video Transformer Without Convolutions
TLDR
Video Transformer with separable-attention for video classification is introduced with standard deviation based topK pooling for attention (pooltopK_std), which reduces the computation by dropping non-informative features along temporal dimension.
Exploring Stronger Feature for Temporal Action Localization
TLDR
The transformer-based methods can achieve better classification performance than convolution-based, but they cannot generate accuracy action proposals and extracting features with larger frame resolution to reduce the loss of spatial information can also effectively improve the performance of temporal action localization.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
TLDR
The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and shows the generalizability of the model despite the domain gap between videos and images.
Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories
TLDR
This work proposes a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration, which enables the learning of long-range dependencies beyond a single clip and significantly improves the accuracy of video classification at a negligible computational overhead.
...
...

References

SHOWING 1-10 OF 81 REFERENCES
A2-Nets: Double Attention Networks
TLDR
This work proposes the "double attention block", a novel component that aggregates and propagates informative global features from the entire spatio-temporal space of input images/videos, enabling subsequent convolution layers to access featuresFrom the entire space efficiently.
ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
TLDR
A new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video and outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation
TLDR
An lightweight and memory-friendly architecture for action recognition that performs on par with or better than current architectures by using only a fraction of resources, and a temporal aggregation module is proposed to model temporal dependencies in a video at very small additional computational costs.
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
TLDR
It is shown that it is possible to replace many of the 3D convolutions by low-cost 2D convolution, suggesting that temporal representation learning on high-level “semantic” features is more useful.
On the Relationship between Self-Attention and Convolutional Layers
TLDR
This work proves that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer, which provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice.
Attention Augmented Convolutional Networks
TLDR
It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar.
Long-Term Feature Banks for Detailed Video Understanding
TLDR
This paper proposes a long-term feature bank—supportive information extracted over the entire span of a video—to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds.
Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification
TLDR
Experimental results on the challenging Kinetics dataset demonstrate that the proposed temporal modeling approaches can significantly improve existing approaches in the large-scale video recognition tasks.
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
TLDR
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
TLDR
This paper factorizes 2D self-attention into two 1Dself-attentions, a novel building block that one could stack to form axial-att attention models for image classification and dense prediction, and achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
...
...