Long Movie Clip Classification with State-Space Video Models

  title={Long Movie Clip Classification with State-Space Video Models},
  author={Md. Mohaiminul Islam and Gedas Bertasius},
Most modern video recognition models are designed to operate on short video clips (e.g., 5-10s in length). Thus, it is challenging to apply such models to long movie understanding tasks, which typically require sophisticated long-range temporal reasoning. The recently introduced video transformers partially address this issue by using long-range temporal self-attention. However, due to the quadratic cost of self-attention, such models are often costly and impractical to use. Instead, we propose… 

Figures and Tables from this paper

Efficient Movie Scene Detection using State-Space Transformers

The proposed TranS4mer model outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being 2 × faster and requiring 3 × less GPU memory than standard Transformer models.

Selective Structured State-Spaces for Long-Form Video Understanding

A novel Selective S5 model that employs a lightweight mask generator to adaptively select informative image tokens resulting in more efficient and accurate modeling of long-term spatiotemporal dependencies in videos and a novel long-short masked contrastive learning (LSMCL) approach that enables the model to predict longer temporal context using shorter input videos.

Spatio-Temporal Crop Aggregation for Video Representation Learning

This work proposes Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that enjoys high scalability at both training and inference time and demonstrates that its video representation yields state-of-the-art performance with linear, non-linear, and KNN probing on common action classification and video understanding datasets.

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

A new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) is introduced to better adapt pre-trained models for long-form VideoQA and achieves state-of-the-art performance and is superior at computation efficiency and interpretability.

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Analysis reveals the effectiveness of components and higher efficiency in long video grounding as the proposed CONE system improves the inference speed by 2x on Ego4d-NLQ and 15x on MAD while keeping the SOTA performance of CONE.

HierVL: Learning Hierarchical Video-Language Embeddings

HierVL is proposed, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations between seconds-long video clips and their accompanying text and successfully transfers to multiple challenging downstream tasks in both zero-shot and fine-tuned settings.

S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces

S4ND is proposed, a new multidimensional SSM layer that extends the continuous-signal modeling ability of SSMs to multid dimensional data including images and videos and demonstrates strong performance by simply swapping Conv2D and self-attention layers with S4ND layers in existing state-of-the-art models.

Movies2Scenes: Using Movie Metadata to Learn Scene Representation

This work proposes a novel contrastive learning approach that uses movie metadata to learn a general-purpose scene representation that consistently outperforms existing state-of-the-art methods on a diverse set of tasks evaluated using multiple benchmark datasets.

Simplified State Space Layers for Sequence Modeling

A state space layer that can leverage efficient and widely implemented parallel scans, allowing S5 to match the computational efficiency of S4, while also achieving state-of-the-art performance on several long-range sequence modeling tasks.

Simple Hardware-Efficient Long Convolutions for Sequence Modeling

It is found that simple interventions--such as squashing the kernel weights--result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling.

ViViT: A Video Vision Transformer

This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.

Is Space-Time Attention All You Need for Video Understanding?

This work presents a convolution-free approach to video classification built exclusively on self-attention over space and time, which adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches.

ECO: Efficient Convolutional Network for Online Video Understanding

A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.

Two-Stream Collaborative Learning With Spatial-Temporal Attention for Video Classification

The two-stream collaborative learning with spatial-temporal attention (TCLSTA) approach achieves the best performance compared with more than 10 state-of-the-art methods and adaptively learns the fusion weights of static and motion streams, thus exploiting the strong complementarity betweenstatic and motion information to improve video classification.

SmallBigNet: Integrating Core and Contextual Views for Video Classification

The SmallBig network outperforms a number of recent state-of-the-art approaches, in terms of accuracy and/or efficiency, and proposes to share convolution in the small and big view branch, which improves model compactness as well as alleviates overfitting.

Towards Long-Form Video Understanding

A framework for modeling long-form videos and evaluation protocols on large-scale datasets are introduced and it is shown that existing state-of-the-art short-term models are limited for long- form tasks.

Video Swin Transformer

  • Ze LiuJia Ning Han Hu
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
This paper advocates an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization.

VideoBERT: A Joint Model for Video and Language Representation Learning

This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.

VideoGraph: Recognizing Minutes-Long Human Activities in Videos

The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation, and it is demonstrated that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos.

Efficiently Modeling Long Sequences with Structured State Spaces

The Structured State Space sequence model (S4) is proposed based on a new parameterization for the SSM, and it is shown that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths.