Beyond short snippets: Deep networks for video classification

  title={Beyond short snippets: Deep networks for video classification},
  author={Joe Yue-Hei Ng and Matthew J. Hausknecht and Sudheendra Vijayanarasimhan and Oriol Vinyals and Rajat Monga and George Toderici},
  journal={2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  • J. Ng, M. Hausknecht, G. Toderici
  • Published 30 March 2015
  • Computer Science
  • 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling… 

Figures and Tables from this paper

Multiscale Deep Alternative Neural Network for Large-Scale Video Classification
The multiscale deep alternative neural network (DANN), a novel architecture combining the strengths of both convolutional neural network and recurrent neural networks to achieve a deep network that can collect rich context hierarchies for video classification, is introduced.
Combining Very Deep Convolutional Neural Networks and Recurrent Neural Networks for Video Classification
Experimental results show that the network architecture using local features extracted by the pre-trained CNN and ConvLSTM for making use of temporal information can achieve the best accuracy in video classification.
Fully convolutional networks for action recognition
A novel two-stream fully convolutional networks architecture for action recognition which can significantly reduce parameters while keeping performance is designed and can achieve the state-of-the-art performance on two challenging datasets.
A Study on the use of State-of-the-Art CNNs with Fine Tuning for Spatial Stream Generation for Activity Recognition
  • M. Ranjit, G. Ganapathy
  • Computer Science
    2019 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT)
  • 2019
This paper is a study on how the State-of-the-Art networks like ResNet50, InceptionV3 and MobileNet perform with fine tuning for spatial feature extraction in the task of activity recognition in videos using LRCN with stacked LSTM.
Fusing Multi-Stream Deep Networks for Video Classification
A multi-stream framework is proposed to fully utilize the rich multimodal information in videos and it is demonstrated that the adaptive fusion method using the class relationship as a regularizer outperforms traditional alternatives that estimate the weights in a "free" fashion.
Exploiting Image-trained CNN Architectures for Unconstrained Video Classification
The proposed late fusion of CNN- and motion-based features can further increase the mean average precision (mAP) on MED'14 from 34.95% to 38.74% and achieves the state-of-the-art classification performance on the challenging UCF-101 dataset.
Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos
This work proposes a new enhancement to Convolutional LSTM networks that supports accommodation of multiple convolutional kernels and layers, and proposes an attention-based mechanism that is specifically designed for the multi-kernel extension.
Dense Convolutional Networks for Efficient Video Analysis
A lightweight network architecture framework to learn spatiotemporal feature from video that tries to merge long-term content in any network feature map and keeps the model as small and as fast as possible while maintaining accuracy.
Recurrent Residual Module for Fast Inference in Videos
This work proposes a framework called Recurrent Residual Module (RRM) to accelerate the CNN inference for video recognition tasks, which has a novel design of using the similarity of the intermediate feature maps of two consecutive frames to largely reduce the redundant computation.


Large-Scale Video Classification with Convolutional Neural Networks
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Two-Stream Convolutional Networks for Action Recognition in Videos
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
ImageNet classification with deep convolutional neural networks
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Going deeper with convolutions
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition
3D Convolutional Neural Networks for Human Action Recognition
A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
Speech recognition with deep recurrent neural networks
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Visualizing and Understanding Convolutional Networks
A novel visualization technique is introduced that gives insight into the function of intermediate feature layers and the operation of the classifier in large Convolutional Network models, used in a diagnostic role to find model architectures that outperform Krizhevsky et al on the ImageNet classification benchmark.
Action Classification in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks
Experimental results show that the proposed approach for action classification in soccer videos outperforms classification methods of related works, and that the combination of the two features (BoW and dominant motion) leads to a classification rate of 92%.
HMDB: A large video database for human motion recognition
This paper uses the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube, to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions.
Sequential Deep Learning for Human Action Recognition
A fully automated deep model, which learns to classify human actions without using any prior knowledge is proposed, which outperforms existing deep models, and gives comparable results with the best related works.