NTU-X: an enhanced large-scale dataset for improving pose-based recognition of subtle human actions

@article{Trivedi2021NTUXAE,
  title={NTU-X: an enhanced large-scale dataset for improving pose-based recognition of subtle human actions},
  author={Neel Trivedi and Anirudh Thatipelli and Ravi Kiran Sarvadevabhatla},
  journal={Proceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing},
  year={2021}
}
The lack of fine-grained joints (facial joints, hand fingers) is a fundamental performance bottleneck for state of the art skeleton action recognition models. Despite this bottleneck, community's efforts seem to be invested only in coming up with novel architectures. To specifically address this bottleneck, we introduce two new pose based human action datasets - NTU60-X and NTU120-X. Our datasets extend the largest existing action recognition dataset, NTU-RGBD. In addition to the 25 body joints… 

Figures and Tables from this paper

Skeleton-based Action Recognition in Non-contextual, In-the-wild and Dense Joint Scenarios

This thesis introduces two new pose based human action recognition datasets NTU60-X and NTU120-X, which extend the largest existing action recognition dataset, NTU-RGBD, and appropriately modify the state of the art approaches to enable training using the introduced datasets.

NTU-DensePose: A New Benchmark for Dense Pose Action Recognition

This paper proposes a dense-pose-based action recognition dataset NTU-DensePose, which automatically annotates 37,060 video samples with two dense poses, IUV equidistant annotation and IUV equivalent annotation and can capture more subtle details and predict human action more accurately than the previous skeleton-based methods.

PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition

PSUMNet is a novel approach for scalable and efficient pose-based action recognition that outperforms competing methods which use 100%-400% more parameters and generalizes to the SHREC hand gesture dataset with competitive performance.

PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition

PSUMNet is a novel approach for scalable and efficient pose-based action recognition that outperforms competing methods which use 100%-400% more parameters and generalizes to the SHREC hand gesture dataset with competitive performance.

One-Shot Open-Set Skeleton-Based Action Recognition

A novel model is proposed that addresses the FSOSR problem with a One-Shot model that is augmented with a discriminator that rejects “unknown” actions, useful for applications in humanoid robotics, because it allows to easily add new classes and determine whether an input sequence is among the ones that are known to the system.

HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling

HuMMan is a large-scale multi-modal 4D human dataset with 1000 human subjects, 400k sequences and 60M frames that voice the need for further study on challenges such as Ne-grained action recognition, dynamic human mesh reconstruction, and textured mesh reconstruction.

References

SHOWING 1-10 OF 20 REFERENCES

Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition

This work proposes an efficient but strong baseline based on Graph Convolutional Network (GCN), where three main improvements are aggregated, i.e., early fused Multiple Input Branches (MIB), Residual GCN (ResGCN) with bottleneck structure and Part-wise Attention (PartAtt) block.

NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding

This work introduces a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames, and investigates a novel one-shot 3D activity recognition problem on this dataset.

NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis

A large-scale dataset for RGB+D human action recognition with more than 56 thousand video samples and 4 million frames, collected from 40 distinct subjects is introduced and a new recurrent neural network structure is proposed to model the long-term temporal correlation of the features for each body part, and utilize them for better action classification.

IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in Videos

The main idea is to let the pose stream decide how much and which appearance information is used in integration based on whether the given pose information is reliable or not, and show that the proposed IntegralAction achieves highly robust performance across in-context and out-of-context action video datasets.

Mining actionlet ensemble for action recognition with depth cameras

An actionlet ensemble model is learnt to represent each action and to capture the intra-class variance, and novel features that are suitable for depth data are proposed.

Quo Vadis, Skeleton Action Recognition ?

The results from benchmarking the top performers of NTU-120 on Skeletics-152 reveal the challenges and domain gap induced by actions 'in the wild', and proposes new frontiers for human action recognition.

Monocular Expressive Body Regression through Body-Driven Attention

ExPose estimates expressive 3D humans more accurately than existing optimization methods at a small fraction of the computational cost by introducing ExPose (EXpressive POse and Shape rEgression), which directly regresses the body, face, and hands, in SMPL-X format, from an RGB image.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

A Survey on 3D Skeleton-Based Action Recognition Using Learning Method

This survey highlights the necessity of action recognition and the significance of 3D-skeleton data, and gives an overall discussion over deep learning-based action recognitin using 3D skeleton data.

Expressive Body Capture: 3D Hands, Face, and Body From a Single Image

This work uses the new method, SMPLify-X, to fit SMPL-X to both controlled images and images in the wild, and evaluates 3D accuracy on a new curated dataset comprising 100 images with pseudo ground-truth.