Spatiotemporal Fusion in 3D CNNs: A Probabilistic View

  title={Spatiotemporal Fusion in 3D CNNs: A Probabilistic View},
  author={Yizhou Zhou and Xiaoyan Sun and Chong Luo and Zhengjun Zha and Wenjun Zeng},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Despite the success in still image recognition, deep neural networks for spatiotemporal signal tasks (such as human action recognition in videos) still suffers from low efficacy and inefficiency over the past years. Recently, human experts have put more efforts into analyzing the importance of different components in 3D convolutional neural networks (3D CNNs) to design more powerful spatiotemporal learning backbones. Among many others, spatiotemporal fusion is one of the essentials. It controls… 

Figures and Tables from this paper

A Comprehensive Review of Recent Deep Learning Techniques for Human Activity Recognition

This survey provides recent convolution-free-based methods which replaced convolution networks with the transformer networks that achieved state-of-the-art results on many human action recognition datasets.

Human Action Recognition from Various Data Modalities: A Review

This paper presents a comprehensive survey of recent progress in deep learning methods for HAR based on the type of input data modality, including the fusion-based and the co-learning-based frameworks.

Human Action Recognition Algorithm Based on Multi-Feature Map Fusion

This work proposes an improved resnext human action recognition method based on multi-feature map fusion, which is higher than most of the state-of-art algorithms.

ASNet: Auto-Augmented Siamese Neural Network for Action Recognition

This framework proposes backpropagating salient patches and randomly cropped samples in the same iteration to perform gradient compensation to alleviate the adverse gradient effects of non-informative samples.

ST-ABN: Visual Explanation Taking into Account Spatio-temporal Information for Video Recognition

Experimental results with Something-Something datasets V1 & V2 demonstrated that ST-ABN enables visual explanation that takes into account spatial and temporal information simultaneously and improves recognition performance.

Shuffle-invariant Network for Action Recognition in Videos

This article proposes a novel action recognition method called the shuffle-invariant network, which adopts the multitask framework, which includes one feature backbone network and three task branches: local critical feature shuffle- Invariant learning, adversarial learning, and an action classification network.

STRNet: Triple-stream Spatiotemporal Relation Network for Action Recognition

This work proposes a novel architecture called spatial temporal relation network (STRNet), which can learn explicit information of appearance, motion and especially the temporal relation information, and applies a flexible and effective strategy to fuse the complementary information from multiple pathways.

Discovering Dynamic Salient Regions for Spatio-Temporal Graph Neural Networks

This model learns nodes that dynamically attach to well-delimited salient regions, which are relevant for a higher-level task, without using any object-level supervision, and shows superior performance to previous graph neural networks models for video classification.

Motion-Driven Visual Tempo Learning for Video-Based Action Recognition

This work proposes a Temporal Correlation Module (TCM), which can be easily embedded into the current action recognition backbones in a plug-in-and-play manner, to extract action visual tempo from low-level backbone features at single-layer remarkably.

Slow-Fast Visual Tempo Learning for Video-based Action Recognition

This work proposes a Temporal Correlation Module (TCM), which can be easily embedded into the current action recognition backbones in a plug-in-and-play manner, to extract action visual tempo from low-level backbone features at single-layer remarkably.



Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

Whether current video datasets have sufficient data for training very deep convolutional neural networks with spatio-temporal three-dimensional (3D) kernels is determined and it is believed that using deep 3D CNNs together with Kinetics will retrace the successful history of 2DCNNs and ImageNet, and stimulate advances in computer vision for videos.

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

This paper devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 x3 x 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time.

MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition

A Mixed Convolutional Tube (MiCT) is proposed that integrates 2D CNNs with the 3D convolution module to generate deeper and more informative feature maps, while reducing training complexity in each round of spatio-temporal fusion.

Learning Spatiotemporal Features with 3D Convolutional Networks

The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

It is shown that it is possible to replace many of the 3D convolutions by low-cost 2D convolution, suggesting that temporal representation learning on high-level “semantic” features is more useful.

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

A novel spatial temporal network (StNet) architecture for both local and global spatial-temporal modeling in videos that outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity is explored.

Spatiotemporal Residual Networks for Video Action Recognition

The novel spatiotemporal ResNet is introduced and evaluated using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Spatio-Temporal Channel Correlation Networks for Action Classification

By fine-tuning this network, this work beats the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, and fine- Tuned on the target datasets, e.g. HMDB51/UCF101 and Kinetics.