Quo Vadis, Skeleton Action Recognition?

  title={Quo Vadis, Skeleton Action Recognition?},
  author={Pranay Gupta and Anirudh Thatipelli and Aditya Aggarwal and Shubhanshu Maheshwari and Neel Trivedi and Sourav Das and Ravi Kiran Sarvadevabhatla},
  journal={International Journal of Computer Vision},
  pages={2097 - 2112}
In this paper, we study current and upcoming frontiers across the landscape of skeleton-based human action recognition. To study skeleton-action recognition in the wild, we introduce Skeletics-152, a curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset. We extend our study to include out-of-context actions by introducing Skeleton-Mimetics, a dataset derived from the recently introduced Mimetics dataset. We also introduce Metaphorics, a… 

NTU60-X: Towards Skeleton-based Recognition of Subtle Human Actions

The effectiveness of NTU60-X in overcoming the aforementioned bottleneck and improve state of the art performance, overall and on hitherto worst performing action categories is demonstrated.

The AMIRO Social Robotics Framework: Deployment and Evaluation on the Pepper Robot

The AMIRO social robotics framework is presented, designed in a modular and robust way for assistive care scenarios, and includes robotic services for navigation, person detection and recognition, multi-lingual natural language interaction and dialogue management, as well as activity recognition and general behavior composition.

Graph Laplacian-Improved Convolutional Residual Autoencoder for Unsupervised Human Action and Emotion Recognition

A convolutional residual autoencoder that models the skeletal geometry across the temporal dynamics of the data without relying on computationally expensive recurrent architectures and implements a Graph Laplacian Regularization leveraging upon the implicit skeleton joints connectivity, further improving the robustness of the feature embeddings learned without using action or emotion labels.

ConfLab: A Data Collection Concept, Dataset, and Benchmark for Machine Analysis of Free-Standing Social Interactions in the Wild

The Conference Living Lab (ConfLab) is proposed, a new concept for multimodal multisensor data collection of in-the-wild free-standing social conversations and benchmarks showcase some of the open research tasks related to in the wild privacy-preserving social data analysis.

Skeleton-based Action Recognition in Non-contextual, In-the-wild and Dense Joint Scenarios

This thesis introduces two new pose based human action recognition datasets NTU60-X and NTU120-X, which extend the largest existing action recognition dataset, NTU-RGBD, and appropriately modify the state of the art approaches to enable training using the introduced datasets.

Holistic Interaction Transformer Network for Action Detection

This paper proposes a novel multi-modal HIT network that leverages the largely ignored, but critical hand and pose information essential to most human actions and significantly outperforms previous approaches on the J-HMDB, UCF101-24, and MultiSports datasets.

Learning from the Best: Contrastive Representations Learning Across Sensor Locations for Wearable Activity Recognition

This work proposes a method that facilitates the use of information from sensors that are only present during the training process and are unavailable during the later use of the system through contrastive loss that is combined with the classification loss during joint training.

Rank-GCN for Robust Action Recognition

A robust skeleton-based action recognition method with graph convolutional network (GCN) that uses the new adjacency matrix, called Rank-GCN, which finds not only performance improvements but also robustness against swapping, location shifting and dropping of certain nodes.

Cross-Skeleton Interaction Graph Aggregation Network for Representation Learning of Mouse Social Behaviour

A Cross-Skeleton Interaction Graph Aggregation Network (CS-IGANet) is proposed to learn abundant dynamics of freely interacting mice and a novel Interaction-Aware Transformer (IAT) is designed to dynamically learn the graph-level representation of social behaviours and update the node- level representation, guided by the proposed interaction-aware self-attention mechanism.



Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

Mimetics: Towards Understanding Human Actions Out of Context

This paper proposes to benchmark action recognition methods in such absence of context and introduces a novel dataset, Mimetics, consisting of mimed actions for a subset of 50 classes from the Kinetics benchmark, and shows that applying a shallow neural network with a single temporal convolution over body pose features transferred to the action recognition problem performs surprisingly well.

PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding

A new large scale benchmark (PKU-MMD) for continuous multi-modality 3D human action understanding and cover a wide range of complex human activities with well annotated information to benefit future researches on action detection for the community.

Interpretable 3D Human Action Analysis with Temporal Convolutional Networks

  • Tae Soo KimA. Reiter
  • Computer Science
    2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
  • 2017
This work proposes to use a new class of models known as Temporal Convolutional Neural Networks (TCN) for 3D human action recognition, and aims to take a step towards a spatio-temporal model that is easier to understand, explain and interpret.

Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition

A novel two-stream adaptive graph convolutional network (2s-AGCN) for skeleton-based action recognition that increases the flexibility of the model for graph construction and brings more generality to adapt to various data samples.

Towards Understanding Action Recognition

It is found that high-level pose features greatly outperform low/mid level features, in particular, pose over time is critical, but current pose estimation algorithms are not yet reliable enough to provide this information.

View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition

A novel view adaptation scheme, which automatically determines the virtual observation viewpoints over the course of an action in a learning based data driven manner, and a two-stream scheme (referred to as VA-fusion) that fuses the scores of the two networks to provide the final prediction, obtaining enhanced performance.

NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis

A large-scale dataset for RGB+D human action recognition with more than 56 thousand video samples and 4 million frames, collected from 40 distinct subjects is introduced and a new recurrent neural network structure is proposed to model the long-term temporal correlation of the features for each body part, and utilize them for better action classification.

ActionXPose: A Novel 2D Multi-view Pose-based Algorithm for Real-time Human Action Recognition

ActionXPose is one of the first algorithms that exploits 2D human poses for HAR and has real-time performance and it is robust to camera movings, subject proximity changes, viewpoint changes, subject appearance changes and provide high generalization degree.