Deep Local Video Feature for Action Recognition
@article{Lan2017DeepLV, title={Deep Local Video Feature for Action Recognition}, author={Zhenzhong Lan and Yi Zhu and Alexander G. Hauptmann and S. Newsam}, journal={2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)}, year={2017}, pages={1219-1225} }
We investigate the problem of representing an entire video using CNN features for human action recognition. End-to-end learning of CNN/RNNs is currently not possible for whole videos due to GPU memory limitations and so a common practice is to use sampled frames as inputs along with the video labels as supervision. However, the global video labels might not be suitable for all of the temporally local samples as the videos often contain content besides the action of interest. We therefore…
95 Citations
Hierarchical Temporal Pooling for Efficient Online Action Recognition
- Computer ScienceMMM
- 2019
This study focuses on improving the accuracy and efficiency of action recognition following the two-stream ConvNets by investigating the effective video-level representations and HTP-Net (RGB) offers competitive action recognition accuracy but is approximately 1-2 orders of magnitude faster than other state-of-the-art single stream action recognition methods.
Texture-Based Input Feature Selection for Action Recognition
- Computer Science
- 2023
A novel method to determine the task-irrelevant content in inputs which increases the domain discrepancy is proposed which is superior to existing models for action recognition on the HMDB-51 dataset and the Penn Action dataset.
Hidden Two-Stream Convolutional Networks for Action Recognition
- Computer ScienceACCV
- 2018
This paper presents a novel CNN architecture that implicitly captures motion information between adjacent frames and directly predicts action classes without explicitly computing optical flow, and significantly outperforms the previous best real-time approaches.
Evolution of Trajectories: A Novel Representation for Deep Action Recognition
- Computer ScienceACM Multimedia
- 2017
A novel temporal representation to capture a long-term interval of motion and integrate the trajectory of motion captured therein that surpasses the current state-of-the-art approaches achieving 71.76% on HMDB51.
A Comprehensive Study of Deep Video Action Recognition
- Computer ScienceArXiv
- 2020
A comprehensive survey of over 200 existing papers on deep learning for video action recognition is provided, starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models.
End-to-end Video-level Representation Learning for Action Recognition
- Computer Science2018 24th International Conference on Pattern Recognition (ICPR)
- 2018
This paper builds upon two-stream ConvNets and proposes Deep networks with Temporal Pyramid Pooling (DTPP), an end-to-end video-level representation learning approach, to address problems of partial observation training and single temporal scale modeling in action recognition.
Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Aggregation
- Computer ScienceIJCAI
- 2018
A deeply-supervised CNN model integrating the powerful aggregation module provides a promising solution to recognize actions in videos and outperforms the state-of-the-art methods.
3D Convolutional Two-Stream Network for Action Recognition in Videos
- Computer Science2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)
- 2019
The proposed architecture follows the two-stream network with a novel 3D Convolutional Network (ConvNets) and pyramid pooling layer, to design an end-to-end behavioral feature learning method that preserves the complete contextual relation of temporal human actions in videos.
Video Understanding via Convolutional Temporal Pooling Network and Multimodal Feature Fusion
- Computer ScienceCoVieW@MM
- 2018
A new end-to-end convolutional neural network architecture for video classification, and a multimodal feature fusion model by concatenating video-level features with those given in the challenge dataset are presented.
Multi-teacher knowledge distillation for compressed video action recognition based on deep learning
- Computer ScienceJ. Syst. Archit.
- 2020
28 References
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
- Computer ScienceECCV
- 2016
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.…
Two-Stream Convolutional Networks for Action Recognition in Videos
- Computer ScienceNIPS
- 2014
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Deep Temporal Linear Encoding Networks
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
A new video representation, called temporal linear encoding (TLE) and embedded inside of CNNs as a new layer, which captures the appearance and motion throughout entire videos, and outperforms current state-of-the-art methods on both datasets.
Efficient Large Scale Video Classification
- Computer ScienceArXiv
- 2015
This work proposes two models for frame-level and video-level classification, the first is a highly efficient mixture of experts while the latter is based on long short term memory neural networks.
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification
- Computer ScienceACM Multimedia
- 2015
This work proposes a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos, and achieves very competitive performance on two popular and challenging benchmarks.
Large-Scale Video Classification with Convolutional Neural Networks
- Computer Science2014 IEEE Conference on Computer Vision and Pattern Recognition
- 2014
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Convolutional Two-Stream Network Fusion for Video Action Recognition
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.
Long-Term Temporal Convolutions for Action Recognition
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2018
It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models.
Exploiting Image-trained CNN Architectures for Unconstrained Video Classification
- Computer ScienceBMVC
- 2015
The proposed late fusion of CNN- and motion-based features can further increase the mean average precision (mAP) on MED'14 from 34.95% to 38.74% and achieves the state-of-the-art classification performance on the challenging UCF-101 dataset.
Action recognition with trajectory-pooled deep-convolutional descriptors
- Computer Science2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2015
This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features, and achieves superior performance to the state of the art on these datasets.