Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

  title={Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification},
  author={Saining Xie and Chen Sun and Jonathan Huang and Zhuowen Tu and Kevin P. Murphy},
  booktitle={European Conference on Computer Vision},
Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification. [] Key Result Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level “semantic” features is more useful.

Learning Efficient Video Representation with Video Shuffle Networks

A parameter-free plug-in component that efficiently reallocates the inputs of 2D convolution so that its receptive field can be extended to the temporal dimension, and can be flexibly inserted into popular 2D CNNs, forming the Video Shuffle Networks (VSN).

Global evaluate-and-rescale network: an efficient model for action recognition

This paper aims to explore an efficient architecture of 3D-CNN for action recognition, and presents Global Evaluate-and-Rescale (GER) Network, which is able to automatically extract the key frames of input data.

Spatio-Temporal Attention Networks for Action Recognition and Detection

A spatio-temporal attention (STA) network that is able to learn the discriminative feature representation for actions, by respectively characterizing the beneficial information at both the frame level and the channel level to enhance the learning capability of the 3D convolutions when handling the complex videos.

3D CNNs with Adaptive Temporal Feature Resolutions

  • Mohsen FayyazEmad Bahrami Rad Juergen Gall
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
This work introduces a differentiable Similarity Guided Sampling (SGS) module, which can be plugged into any existing 3D CNN architecture, and improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy.

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

A novel spatial temporal network (StNet) architecture for both local and global spatial-temporal modeling in videos that outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity is explored.

Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition

An unified framework is developed for both 2D-CNN and 3D- CNN action models, which enables us to remove bells and whistles and provides a common ground for fair comparison, and reveals a significant leap is made in efficiency for action recognition, but not in accuracy.

Gate-Shift Networks for Video Action Recognition

An extensive evaluation of the proposed Gate-Shift Module is performed to study its effectiveness in video action recognition, achieving state-of-the-art results on Something Something-V1 and Diving48 datasets, and obtaining competitive results on EPIC-Kitchens with far less model complexity.

Making a Case for 3D Convolutions for Object Segmentation in Videos

It is shown that 3D CNNs can be effectively applied to dense video prediction tasks such as salient object segmentation, and a simple yet effective encoder-decoder network architecture consisting entirely of 3D convolutions that can be trained end-to-end using a standard cross-entropy loss is proposed.

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

A novel multi-view fusion (MVF) module is introduced to exploit video dynamics using separable convolution for efficiency and the proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.

V4D: 4D Convolutional Neural Networks for Video-level Representation Learning

This paper designs a new 4D residual block able to capture inter-clip interactions, which could enhance the representation power of the original clip-level 3D CNNs, and introduces the training and inference methods for the proposed V4D.



Large-Scale Video Classification with Convolutional Neural Networks

This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

This paper devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 x3 x 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time.

Explorer Action Recognition with Dynamic Image Networks

A novel four stream CNN architecture which can learn from RGB and optical flow frames as well as from their dynamic image representations, and achieves state-of-the-art performance in the UCF101 and HMDB51 respectively.

Convolutional Two-Stream Network Fusion for Video Action Recognition

A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.

Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks

Factorized spatio-temporal convolutional networks (FstCN) are proposed that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers, followed by learning 1D temporal kernel in the upper layers.

C3D: Generic Features for Video Analysis

Convolution 3D feature is proposed, a generic spatio-temporal feature obtained by training a deep 3-dimensional convolutional network on a large annotated video dataset comprising objects, scenes, actions, and other frequently occurring concepts that encapsulate appearance and motion cues and perform well on different video classification tasks.

Beyond short snippets: Deep networks for video classification

This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

ConvNet Architecture Search for Spatiotemporal Feature Learning

This paper presents an empirical ConvNet architecture search for spatiotemporal feature learning, culminating in a deep 3D Residual ConvNet that outperforms C3D by a good margin on Sports-1M, UCF101, HMDB51, THUMOS14, and ASLAN while being 2 times faster at inference time, 2 times smaller in model size, and having a more compact representation.