Learning Spatiotemporal Features with 3D Convolutional Networks

  title={Learning Spatiotemporal Features with 3D Convolutional Networks},
  author={Du Tran and Lubomir D. Bourdev and Rob Fergus and Lorenzo Torresani and Manohar Paluri},
  journal={2015 IEEE International Conference on Computer Vision (ICCV)},
We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. [] Key Result Finally, they are conceptually very simple and easy to train and use.

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

The 3D ResNets trained on the Kinetics did not suffer from overfitting despite the large number of parameters of the model, and achieved better performance than relatively shallow networks, such as C3D.

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

This paper devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 x3 x 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time.

Optimizing Spatiotemporal Feature Learning in 3D Convolutional Neural Networks With Pooling Blocks

This work proposes Pooling Block (PB) as an enhanced pooling operation for optimizing action recognition by 3D CNNs and yields significant improvement in 3DCNN performance with a comparatively small increase in the number of trainable parameters.

Using efficient group pseudo-3D network to learn spatio-temporal features

This paper proposes an efficient group pseudo-3D (GP3D) convolution to reduce the model size and need less computational power, and built the GP3D with MobileNetV3 to extend the 2D pre-training parameters directly to the 3D convolutional network.

Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification

By finetuning this network, the proposed video convolutional network T3D outperforms the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, and finetuned on the target datasets, e.g. HMDB51/UCF101.

3D CNNs with Adaptive Temporal Feature Resolutions

This work introduces a differentiable Similarity Guided Sampling (SGS) module, which can be plugged into any existing 3D CNN architecture, and improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy.

Spatiotemporal Pyramid Pooling in 3D Convolutional Neural Networks for Action Recognition

A new network architecture, called STPP-net, is developed that outperforms the original 3D ConvNets by a large margin on three large-scale video classification/action recognition benchmarks including HMDB51, UCF101, and Kinetics.

Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition

The empirical results evaluated on the action recognition dataset UCF-101 demonstrate that the fusion of 1D, 2D and 3D convolutions outperforms its 3D model of the same depth, with fewer parameters and reduces overfitting.

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

Whether current video datasets have sufficient data for training very deep convolutional neural networks with spatio-temporal three-dimensional (3D) kernels is determined and it is believed that using deep 3D CNNs together with Kinetics will retrace the successful history of 2DCNNs and ImageNet, and stimulate advances in computer vision for videos.

Resource Efficient 3 D Convolutional Neural Networks

The results of this study show that these models can be utilized for different types of real-world applications since they provide real-time performance with considerable accuracies and memory usage.



Large-Scale Video Classification with Convolutional Neural Networks

This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

3D Convolutional Neural Networks for Human Action Recognition

A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.

Learning Human Pose Estimation Features with Convolutional Networks

Abstract: This paper introduces a new architecture for human pose estimation using a multi- layer convolutional network architecture and a modified learning technique that learns low-level features

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

DeCAF, an open-source implementation of deep convolutional activation features, along with all associated network parameters, are released to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.

Very Deep Convolutional Networks for Large-Scale Image Recognition

This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis

This paper presents an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data and discovered that this method performs surprisingly well when combined with deep learning techniques such as stacking and convolution to learn hierarchical representations.

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.

Beyond short snippets: Deep networks for video classification

This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.

ImageNet classification with deep convolutional neural networks

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.