• Corpus ID: 11797475

Two-Stream Convolutional Networks for Action Recognition in Videos

@inproceedings{Simonyan2014TwoStreamCN,
  title={Two-Stream Convolutional Networks for Action Recognition in Videos},
  author={Karen Simonyan and Andrew Zisserman},
  booktitle={NIPS},
  year={2014}
}
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. [] Key Method First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used…

Figures and Tables from this paper

Frame-skip Convolutional Neural Networks for action recognition
  • Yinan Liu, Q. Wu, L. Tang
  • Computer Science
    2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)
  • 2017
TLDR
A novel video dynamics mining strategy which takes advantage of the motion tracking in the video is proposed and a frame skip scheme is introduced to the ConvNets, it stacks different modalities of optical flow to build a novel motion representation.
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.
Action Recognition in Videos Using Multi-stream Convolutional Neural Networks
TLDR
A different pre-training procedure for the latter stream is developed using visual rhythm images extracted from a large and challenging video dataset, the Kinetics, which aims to classify trimmed videos based on the action being performed by one or more agents.
Fully convolutional networks for action recognition
TLDR
A novel two-stream fully convolutional networks architecture for action recognition which can significantly reduce parameters while keeping performance is designed and can achieve the state-of-the-art performance on two challenging datasets.
Convolutional Two-Stream Network Fusion for Video Action Recognition
TLDR
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.
AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos
TLDR
The effectiveness of the proposed pooling method consistently improves on baseline pooling methods, with both RGB and optical flow based Convolutional networks, and in combination with complementary video representations is shown.
Action Recognition Using Co-trained Deep Convolutional Neural Networks
TLDR
This work proposes a novel semi-supervised learning approach that allows multiple streams to supervise each other in a co-training strategy, thus making the training simultaneous in the two modalities, and demonstrates the effectiveness of the approach through extensive experiments on the UCF 101 and HMDB datasets.
Pooling the Convolutional Layers in Deep ConvNets for Action Recognition
TLDR
Two state-of-the-art ConvNets are utilized, i.e., the very deep spatial net (VGGNet) and the temporal net from Two-Stream ConvNet, for action representation, and a new layer, called frame-diff layer, is proposed, which achieves the accuracy on UCF101 and HMDB51 datasets.
Action segmentation and understanding in RGB videos with convolutional neural networks
TLDR
This work proposes three techniques for accelerating a modern action recognition pipeline and selects two of the reviewed deep learning works, a Convolutional Neural Network framework that makes use of a small number of video frames for obtaining robust predictions and a new Graphics Processing Unit (GPU) video decoding software developed by NVIDIA.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 37 REFERENCES
3D Convolutional Neural Networks for Human Action Recognition
TLDR
A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
Large-Scale Video Classification with Convolutional Neural Networks
TLDR
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis
TLDR
This paper presents an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data and discovered that this method performs surprisingly well when combined with deep learning techniques such as stacking and convolution to learn hierarchical representations.
Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice
Convolutional Learning of Spatio-temporal Features
TLDR
A model that learns latent representations of image sequences from pairs of successive images is introduced, allowing it to scale to realistic image sizes whilst using a compact parametrization.
Action Recognition with Stacked Fisher Vectors
TLDR
Experimental results demonstrate the effectiveness of SFV, and the combination of the traditional FV and SFV outperforms state-of-the-art methods on these datasets with a large margin.
Learning realistic human actions from movies
TLDR
A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.
Return of the Devil in the Details: Delving Deep into Convolutional Nets
TLDR
It is shown that the data augmentation techniques commonly applied to CNN-based methods can also be applied to shallow methods, and result in an analogous performance boost, and it is identified that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance.
Deep Fisher Networks for Large-Scale Image Classification
TLDR
This paper proposes a version of the state-of-the-art Fisher vector image encoding that can be stacked in multiple layers, and significantly improves on standard Fisher vectors, and obtains competitive results with deep convolutional networks at a smaller computational learning cost.
Deep Learning of Invariant Spatio-Temporal Features from Video
TLDR
The ST-DBN has superior performance on discriminative and generative tasks including action recognition and video denoising when compared to convolutional deep belief networks applied on a per-frame basis and has superior feature invariance properties compared to CDBNs.
...
1
2
3
4
...