Pose and Joint-Aware Action Recognition

  title={Pose and Joint-Aware Action Recognition},
  author={Anshul B. Shah and Shlok Kumar Mishra and Ankan Bansal and Jun-Cheng Chen and Ramalingam Chellappa and Abhinav Shrivastava},
  journal={2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
Recent progress on action recognition has mainly focused on RGB and optical flow features. In this paper, we approach the problem of joint-based action recognition. Unlike other modalities, constellation of joints and their motion generate models with succinct human motion information for activity recognition. We present a new model for joint-based action recognition, which first extracts motion features from each joint separately through a shared motion encoder before performing collective… 

Pose-Appearance Relational Modeling for Video Action Recognition

A Pose-Appearance Relational Network (PARNet) is proposed, which models the correlation between human pose and image appearance, and combines the benefits of these two modalities to improve the robustness towards unconstrained real-world videos.

Deep Learning-Enabled Multitask System for Exercise Recognition and Counting

A deep-learning multitask model of exercise recognition and repetition counting that can estimate human pose, identify physical activities and count repeated motions is proposed and obtained state-of-the-art results.

Distillation of human–object interaction contexts for action recognition

This paper proposes the Global‐Local Interaction Distillation Network (GLIDN), learning human and object interactions through space and time via knowledge distillation for holistic HOI understanding, which outperforms the baselines and counterpart approaches.

Learning Visual Representations for Transfer Learning by Suppressing Texture

This paper proposes to use classic methods based on anisotropic diffusion to augment training using images with suppressed texture to address problems of self-supervised learning and suggests that this approach helps in learning better representations that better transfer.

STIT: Spatio-Temporal Interaction Transformers for Human-Object Interaction Recognition in Videos

This work proposes the Spatio-Temporal Interaction Transformer-based (STIT) network, a network that learns humans and objects context at specific frame time and learns the relations at a higher level between spatial context representations at different time steps, capturing long-term dependencies across frames.

Skeleton Based Human Activity Prediction in Gait Thermal images using Siamese Networks

  • P. SrihariJ. Harikiran
  • Computer Science, Environmental Science
    2022 6th International Conference on Electronics, Communication and Aerospace Technology
  • 2022
This research work has proposed a human activity recognition system using Siamese Networks of Gait Skeleton Thermal Images that has achieved a better accuracy when compared to CNN+LSTM, LRCN, Inflated 3D CNN.

Two-Branch Stacked Transformer for 2D Skeleton-based Action Recognition

A model combining short action-snippets for storing meaningful information about human body transition and a deep network configured by two parallel branches of Transformer for thoroughly learning the temporal correlation of skeletal representations in upper and lower body parts is proposed, enabling to handle of partial occlusions of skeleton data and boosting the HAR performance.

2D Skeleton-based Action Recognition Using Action-Snippets and Sequential Deep Learning

This paper proposes a discriminative representation of the action-snippet (i.e., the very short sequence) that captures meaningful characteristics of human pose and body transition and employs adequate deep sequential neural networks (DSNNs) to thoroughly learn the temporal relation of action- Snippets in a whole sequence.



Action Recognition with Joints-Pooled 3D Deep Convolutional Descriptors

This paper proposes to incorporate joint positions with currently popular deep-learned features for action recognition into discriminative descriptors based on joint positions, and demonstrates that the joints-pooled 3D deep convolutional descriptors (JDDs) are more effective and robust than the original 3D CNN features and other competing features.

Dynamic Motion Representation for Human Action Recognition

The experimental results show that training a convolutional neural network with the dynamic motion representation outperforms state-of-the-art action recognition models and is obtainable on HMDB, JHMDB, UCF-101, and AVA datasets.

PA3D: Pose-Action 3D Machine for Video Recognition

This work proposes a concise Pose-Action 3D Machine (PA3D), which can effectively encode multiple pose modalities within a unified 3D framework, and consequently learn spatio-temporal pose representations for action recognition.

IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in Videos

The main idea is to let the pose stream decide how much and which appearance information is used in integration based on whether the given pose information is reliable or not, and show that the proposed IntegralAction achieves highly robust performance across in-context and out-of-context action video datasets.

Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection

This paper proposes a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images and introduces a Markov chain model which adds cues successively.

RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos

The proposed recurrent pose-attention network (RPAN) is an end-toend recurrent network which can exploit important spatialtemporal evolutions of human pose to assist action recognition in a unified framework and outperforms the recent state-of-the-art methods on these challenging datasets.

P-CNN: Pose-Based CNN Features for Action Recognition

A new Pose-based Convolutional Neural Network descriptor (P-CNN) for action recognition that aggregates motion and appearance information along tracks of human body parts is proposed.

Towards Understanding Action Recognition

It is found that high-level pose features greatly outperform low/mid level features, in particular, pose over time is critical, but current pose estimation algorithms are not yet reliable enough to provide this information.

2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning

It is shown that a single architecture can be used to solve the two problems in an efficient way and still achieves state-of-the-art results, and that optimization from end-to-end leads to significantly higher accuracy than separated learning.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.