Long Movie Clip Classification with State-Space Video Models
@article{Islam2022LongMC, title={Long Movie Clip Classification with State-Space Video Models}, author={Md. Mohaiminul Islam and Gedas Bertasius}, journal={ArXiv}, year={2022}, volume={abs/2204.01692} }
Most modern video recognition models are designed to operate on short video clips (e.g., 5-10s in length). Thus, it is challenging to apply such models to long movie understanding tasks, which typically require sophisticated long-range temporal reasoning. The recently introduced video transformers partially address this issue by using long-range temporal self-attention. However, due to the quadratic cost of self-attention, such models are often costly and impractical to use. Instead, we propose…
Figures and Tables from this paper
13 Citations
Efficient Movie Scene Detection using State-Space Transformers
- Computer ScienceArXiv
- 2022
The proposed TranS4mer model outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being 2 × faster and requiring 3 × less GPU memory than standard Transformer models.
Selective Structured State-Spaces for Long-Form Video Understanding
- Computer Science
- 2023
A novel Selective S5 model that employs a lightweight mask generator to adaptively select informative image tokens resulting in more efficient and accurate modeling of long-term spatiotemporal dependencies in videos and a novel long-short masked contrastive learning (LSMCL) approach that enables the model to predict longer temporal context using shorter input videos.
Spatio-Temporal Crop Aggregation for Video Representation Learning
- Computer ScienceArXiv
- 2022
This work proposes Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that enjoys high scalability at both training and inference time and demonstrates that its video representation yields state-of-the-art performance with linear, non-linear, and KNN probing on common action classification and video understanding datasets.
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
- Computer ScienceArXiv
- 2022
A new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) is introduced to better adapt pre-trained models for long-form VideoQA and achieves state-of-the-art performance and is superior at computation efficiency and interpretability.
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
- Computer ScienceArXiv
- 2022
Analysis reveals the effectiveness of components and higher efficiency in long video grounding as the proposed CONE system improves the inference speed by 2x on Ego4d-NLQ and 15x on MAD while keeping the SOTA performance of CONE.
HierVL: Learning Hierarchical Video-Language Embeddings
- Computer ScienceArXiv
- 2023
HierVL is proposed, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations between seconds-long video clips and their accompanying text and successfully transfers to multiple challenging downstream tasks in both zero-shot and fine-tuned settings.
S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces
- Computer ScienceNeurIPS
- 2022
S4ND is proposed, a new multidimensional SSM layer that extends the continuous-signal modeling ability of SSMs to multid dimensional data including images and videos and demonstrates strong performance by simply swapping Conv2D and self-attention layers with S4ND layers in existing state-of-the-art models.
Movies2Scenes: Using Movie Metadata to Learn Scene Representation
- Computer Science
- 2022
This work proposes a novel contrastive learning approach that uses movie metadata to learn a general-purpose scene representation that consistently outperforms existing state-of-the-art methods on a diverse set of tasks evaluated using multiple benchmark datasets.
Simplified State Space Layers for Sequence Modeling
- Computer ScienceArXiv
- 2022
A state space layer that can leverage efficient and widely implemented parallel scans, allowing S5 to match the computational efficiency of S4, while also achieving state-of-the-art performance on several long-range sequence modeling tasks.
Simple Hardware-Efficient Long Convolutions for Sequence Modeling
- Computer ScienceArXiv
- 2023
It is found that simple interventions--such as squashing the kernel weights--result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling.
59 References
ViViT: A Video Vision Transformer
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.
Is Space-Time Attention All You Need for Video Understanding?
- Computer ScienceICML
- 2021
This work presents a convolution-free approach to video classification built exclusively on self-attention over space and time, which adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches.
ECO: Efficient Convolutional Network for Online Video Understanding
- Computer ScienceECCV
- 2018
A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.
Two-Stream Collaborative Learning With Spatial-Temporal Attention for Video Classification
- Computer ScienceIEEE Transactions on Circuits and Systems for Video Technology
- 2019
The two-stream collaborative learning with spatial-temporal attention (TCLSTA) approach achieves the best performance compared with more than 10 state-of-the-art methods and adaptively learns the fusion weights of static and motion streams, thus exploiting the strong complementarity betweenstatic and motion information to improve video classification.
SmallBigNet: Integrating Core and Contextual Views for Video Classification
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
The SmallBig network outperforms a number of recent state-of-the-art approaches, in terms of accuracy and/or efficiency, and proposes to share convolution in the small and big view branch, which improves model compactness as well as alleviates overfitting.
Towards Long-Form Video Understanding
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
A framework for modeling long-form videos and evaluation protocols on large-scale datasets are introduced and it is shown that existing state-of-the-art short-term models are limited for long- form tasks.
Video Swin Transformer
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
This paper advocates an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization.
VideoBERT: A Joint Model for Video and Language Representation Learning
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
VideoGraph: Recognizing Minutes-Long Human Activities in Videos
- Computer ScienceArXiv
- 2019
The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation, and it is demonstrated that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos.
Efficiently Modeling Long Sequences with Structured State Spaces
- Computer ScienceICLR
- 2022
The Structured State Space sequence model (S4) is proposed based on a new parameterization for the SSM, and it is shown that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths.