Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

  title={Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset},
  author={Jo{\~a}o Carreira and Andrew Zisserman},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. [] Key Method We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from…

Evaluating the Feasibility of Deep Learning for Action Recognition in Small Datasets

This work aims to check the real feasibility of employing deep learning in the context of smallsized action recognition datasets, and performs a thorough empirical analysis in which distinct network architectures with hyperparameter optimization are investigated, as well as different data pre-processing techniques and fusion methods.

A Comprehensive Study of Deep Video Action Recognition

A comprehensive survey of over 200 existing papers on deep learning for video action recognition is provided, starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models.

Improved two-stream model for human action recognition

This paper attempts to design and implement a new two-stream model by using an LSTM-based model in its spatial stream to extract both spatial and temporal features in RGB frames, and implements a DenseNet in the temporal stream to improve the recognition accuracy.

Multi-Task Learning of Generalizable Representations for Video Action Recognition

This work takes the optical flows and the RGB frames by taking them as auxiliary supervisions, and thus naming the model as Reversed Two-Stream Networks (Rev2Net), which constraints the discrepancy of the multi-task features in a self-supervised manner.

Temporal Segment Networks for Action Recognition in Videos

The proposed TSN framework, called temporal segment network (TSN), aims to model long-range temporal structure with a new segment-based sampling and aggregation scheme and won the video classification track at the ActivityNet challenge 2016 among 24 teams.

Is Appearance Free Action Recognition Possible?

A novel architecture that revives explicit recovery of optical recovery within a contemporary design for best performance on AFD and RGB is motivated, which empirically validates the ES2-X3D design.

Enabling Detailed Action Recognition Evaluation Through Video Dataset Augmentation

The Human-centric Analysis Toolkit is introduced, which enable evaluation of the learned background bias without the need for new manual video annotation and open-source HAT to enable the community to leverage its metrics to design more robust and generalizable human action recognition models.

Action Machine: Rethinking Action Recognition in Trimmed Videos

This work presents a conceptually simple, general and high-performance framework for action recognition in trimmed videos, aiming at person-centric modeling, and extends the Inflated 3D ConvNet by adding a branch for human pose estimation and a 2D CNN for pose-based action recognition.

VideoLightFormer: Lightweight Action Recognition using Transformers

This work proposes a novel, lightweight action recognition architecture, VideoLightFormer, which carefully extends the 2D convolutional Temporal Segment Network with transformers, while maintaining spatial and temporal video structure throughout the entire model.



Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Learning realistic human actions from movies

A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.

The Kinetics Human Action Video Dataset

The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given.

Long-Term Temporal Convolutions for Action Recognition

It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models.

Large-Scale Video Classification with Convolutional Neural Networks

This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

Convolutional Two-Stream Network Fusion for Video Action Recognition

A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.

Beyond short snippets: Deep networks for video classification

This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

This work introduces UCF101 which is currently the largest dataset of human actions and provides baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%.