Multi-Modal Three-Stream Network for Action Recognition

  title={Multi-Modal Three-Stream Network for Action Recognition},
  author={Muhammad Usman Khalid and Jie Yu},
  journal={2018 24th International Conference on Pattern Recognition (ICPR)},
Human action recognition in video is an active yet challenging research topic due to high variation and complexity of data. In this paper, a novel video based action recognition framework utilizing complementary cues is proposed to handle this complex problem. Inspired by the successful two stream networks for action classification, additional pose features are studied and fused to enhance understanding of human action in a more abstract and semantic way. Towards practices, not only ground… Expand
Three-Stream Graph Convolutional Networks for Zero-Shot Action Recognition
  • Na Wu, K. Kawamoto
  • Computer Science
  • 2020 Joint 11th International Conference on Soft Computing and Intelligent Systems and 21st International Symposium on Advanced Intelligent Systems (SCIS-ISIS)
  • 2020
This paper proposes a three-stream graph convolutional network that processes both video RGB image data and skeleton data of the human body and shows that the model provides better accuracy than a baseline model. Expand
Zero-Shot Action Recognition with Three-Stream Graph Convolutional Networks †
This paper proposes a three-stream graph convolutional network that processes both RGB and skeleton data and predicts the final results for ZSAR, and proves that the model can learn from human experience, which can make the model more accurate. Expand
CoReHAR: A Hybrid Deep Network for Video Action Recognition
CoReHAR is proposed, a novel Human Action Recognition method that employs both deep Convolutional and Recurrent neural networks on raw video frames that is compared to the state-of-the-art methods, showing a considerable efficiency increase. Expand
Improving Deep Learning Approaches for Human Activity Recognition based on Natural Language Processing of Action Labels
This paper shows that the information contained in the label description of action classes (action labels) can be exploited to extract information regarding their similarity which can be used to steer the learning process and improve the activity recognition performance. Expand
A Novel Parameter Initialization Technique Using RBM-NN for Human Action Recognition
This paper proposes a novel parameter initialization technique using the Maxout activation function that shows an improved recognition rate when compared to other state-of-the-art learning models. Expand
Weighted voting of multi-stream convolutional neural networks for video-based action recognition using optical flow rhythms
A multi-stream architecture based on the weighted voting of convolutional neural networks to deal with the problem of recognizing human actions in videos is proposed, with a new stream, Optical Flow Rhythm, besides using other streams for diversity. Expand
Beyond 2D: Fusion of Monocular 3D Pose, Motion and Appearance for Human Action Recognition
  • Wei Lin, Jie Yu
  • Computer Science
  • 2019 22th International Conference on Information Fusion (FUSION)
  • 2019
It is shown that the proposed three-stream fusion of 3D pose, motion and appearance outperforms state-of-the-art methods on sub-JHMDB, Penn Action and NTU RGB+D datasets. Expand
Pose Guided Dynamic Image Network for Human Action Recognition in Person Centric Videos
An attempt is made to explore the concept of pose estimation and video representation using dynamic image to solve the dual purpose of privacy preserving and decreasing the load on network for transfer of videos over the network for analysis. Expand
Inferring Tasks and Fluents in Videos by Learning Causal Relations
A novel model to jointly infer object fluents and complex tasks in videos and a structural SVM framework is adopted to jointly train the task, fluent, cause, and effect parameters is proposed. Expand
From Human Pose to On-Body Devices for Human-Activity Recognition
This paper proposes to fine-tune deep architectures, trained using sequences of human poses from a large dataset and their derivatives, for solving HAR on inertial measurements from on-body devices. Expand


P-CNN: Pose-Based CNN Features for Action Recognition
A new Pose-based Convolutional Neural Network descriptor (P-CNN) for action recognition that aggregates motion and appearance information along tracks of human body parts is proposed. Expand
Two-Stream Convolutional Networks for Action Recognition in Videos
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Expand
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.Expand
Joint action recognition and pose estimation from video
A spatial-temporal And-Or graph model is introduced to represent action at three scales and achieves state-of-art accuracy in action recognition while also improving pose estimation. Expand
Pose-conditioned Spatio-Temporal Attention for Human Action Recognition
It is shown that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself, and a temporal attention mechanism learns how to fuse LSTM features over time. Expand
Towards Understanding Action Recognition
It is found that high-level pose features greatly outperform low/mid level features, in particular, pose over time is critical, but current pose estimation algorithms are not yet reliable enough to provide this information. Expand
Cross-View Action Modeling, Learning, and Recognition
A novel multiview spatio-temporal and-or graph (MST-AOG) representation for cross-view action recognition, which takes advantage of the 3D human skeleton data obtained from Kinect cameras to avoid annotating enormous multi-view video frames, but the recognition does not need 3D information and is based on 2D video input. Expand
Convolutional Two-Stream Network Fusion for Video Action Recognition
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated. Expand
An Approach to Pose-Based Action Recognition
This work improves a state of the art method for estimating human joint locations from videos and incorporates additional segmentation cues and temporal constraints to select the ``best'' one, which is able to localize body joints more accurately than existing methods. Expand
From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
A novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes is presented, showing significant improvement over state-of-the-art low-level features, while providing spatiotemporal localization as additional output, which sheds further light into detailed action understanding. Expand