Deep Local Video Feature for Action Recognition

  title={Deep Local Video Feature for Action Recognition},
  author={Zhenzhong Lan and Yi Zhu and Alexander G. Hauptmann and S. Newsam},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
  • Zhenzhong LanYi Zhu S. Newsam
  • Published 25 January 2017
  • Computer Science
  • 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
We investigate the problem of representing an entire video using CNN features for human action recognition. End-to-end learning of CNN/RNNs is currently not possible for whole videos due to GPU memory limitations and so a common practice is to use sampled frames as inputs along with the video labels as supervision. However, the global video labels might not be suitable for all of the temporally local samples as the videos often contain content besides the action of interest. We therefore… 

Tables from this paper

Hierarchical Temporal Pooling for Efficient Online Action Recognition

This study focuses on improving the accuracy and efficiency of action recognition following the two-stream ConvNets by investigating the effective video-level representations and HTP-Net (RGB) offers competitive action recognition accuracy but is approximately 1-2 orders of magnitude faster than other state-of-the-art single stream action recognition methods.

Texture-Based Input Feature Selection for Action Recognition

A novel method to determine the task-irrelevant content in inputs which increases the domain discrepancy is proposed which is superior to existing models for action recognition on the HMDB-51 dataset and the Penn Action dataset.

Hidden Two-Stream Convolutional Networks for Action Recognition

This paper presents a novel CNN architecture that implicitly captures motion information between adjacent frames and directly predicts action classes without explicitly computing optical flow, and significantly outperforms the previous best real-time approaches.

Evolution of Trajectories: A Novel Representation for Deep Action Recognition

A novel temporal representation to capture a long-term interval of motion and integrate the trajectory of motion captured therein that surpasses the current state-of-the-art approaches achieving 71.76% on HMDB51.

A Comprehensive Study of Deep Video Action Recognition

A comprehensive survey of over 200 existing papers on deep learning for video action recognition is provided, starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models.

End-to-end Video-level Representation Learning for Action Recognition

This paper builds upon two-stream ConvNets and proposes Deep networks with Temporal Pyramid Pooling (DTPP), an end-to-end video-level representation learning approach, to address problems of partial observation training and single temporal scale modeling in action recognition.

Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Aggregation

A deeply-supervised CNN model integrating the powerful aggregation module provides a promising solution to recognize actions in videos and outperforms the state-of-the-art methods.

3D Convolutional Two-Stream Network for Action Recognition in Videos

The proposed architecture follows the two-stream network with a novel 3D Convolutional Network (ConvNets) and pyramid pooling layer, to design an end-to-end behavioral feature learning method that preserves the complete contextual relation of temporal human actions in videos.

Video Understanding via Convolutional Temporal Pooling Network and Multimodal Feature Fusion

A new end-to-end convolutional neural network architecture for video classification, and a multimodal feature fusion model by concatenating video-level features with those given in the challenge dataset are presented.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Deep Temporal Linear Encoding Networks

A new video representation, called temporal linear encoding (TLE) and embedded inside of CNNs as a new layer, which captures the appearance and motion throughout entire videos, and outperforms current state-of-the-art methods on both datasets.

Efficient Large Scale Video Classification

This work proposes two models for frame-level and video-level classification, the first is a highly efficient mixture of experts while the latter is based on long short term memory neural networks.

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

This work proposes a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos, and achieves very competitive performance on two popular and challenging benchmarks.

Large-Scale Video Classification with Convolutional Neural Networks

This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

Convolutional Two-Stream Network Fusion for Video Action Recognition

A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.

Long-Term Temporal Convolutions for Action Recognition

It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models.

Exploiting Image-trained CNN Architectures for Unconstrained Video Classification

The proposed late fusion of CNN- and motion-based features can further increase the mean average precision (mAP) on MED'14 from 34.95% to 38.74% and achieves the state-of-the-art classification performance on the challenging UCF-101 dataset.

Action recognition with trajectory-pooled deep-convolutional descriptors

This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features, and achieves superior performance to the state of the art on these datasets.