One-shot action recognition in challenging therapy scenarios

@article{Sabater2021OneshotAR,
  title={One-shot action recognition in challenging therapy scenarios},
  author={Alberto Sabater and Laura Santos and Jos{\'e} Santos-Victor and Alexandre Bernardino and Luis Montesano and Ana Cristina Murillo},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
  year={2021},
  pages={2771-2779}
}
One-shot action recognition aims to recognize new action categories from a single reference example, typically referred to as the anchor example. This work presents a novel approach for one-shot action recognition in the wild that computes motion representations robust to variable kinematic conditions. One-shot action recognition is then performed by evaluating anchor and target motion representations. We also develop a set of complementary steps that boost the action recognition performance in… 

Figures and Tables from this paper

Domain and View-Point Agnostic Hand Action Recognition

This work introduces a novel skeleton-based hand motion representation model that is agnostic to the application domain or camera recording view-point and achieves comparable performance to intra-domain state-of-the-art methods.

Delving Deep into One-Shot Skeleton-based Action Recognition with Diverse Occlusions

Trans4SOAR is introduced – a new transformer-based model which leverages three data streams and mixed attention fusion mechanism to alleviate the adverse effects caused by occlusions and yields state-of-the-art in the standard SOAR without occlusion.

Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition

This work proposes a novel spatial matching strategy consisting of spatial disentanglement and spatial activation that can be effectively inserted into existing temporal alignment frameworks, achieving considerable performance improvements as well as inherent explainability.

MotionBERT: Unified Pretraining for Human Motion Analysis

MotionBERT, a unified pretraining framework, to tackle different sub-tasks of human motion analysis including 3D pose estimation, skeleton-based action recognition, and mesh recovery, which achieves state-of-the-art performance on all three downstream tasks by simply tuning the pretrained motion encoder with 1-2 linear layers, which demonstrates the versatility of the learned motion representations.

ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers

Inspired by recent success of feature enhancement methods in semi-supervised learning, PROFORMER is introduced – an improved training strategy which uses soft-attention applied on iteratively estimated action category PROtotypes used to augment the embeddings and compute an auxiliary consistency loss.

References

SHOWING 1-10 OF 26 REFERENCES

One-Shot Learning for Real-Time Action Recognition

The main contribution of the paper is a real-time system for one-shot action modeling, and the effectiveness of sparse coding techniques to represent 3D actions is highlighted.

Skeleton based Zero Shot Action Recognition in Joint Pose-Language Semantic Space

This work presents a body pose based zero shot action recognition network and demonstrates how this pose-language semantic space encodes knowledge which allows the model to correctly predict actions not seen during training.

Interpretable 3D Human Action Analysis with Temporal Convolutional Networks

  • Tae Soo KimA. Reiter
  • Computer Science
    2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
  • 2017
This work proposes to use a new class of models known as Temporal Convolutional Neural Networks (TCN) for 3D human action recognition, and aims to take a step towards a spatio-temporal model that is easier to understand, explain and interpret.

View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition

A novel view adaptation scheme, which automatically determines the virtual observation viewpoints over the course of an action in a learning based data driven manner, and a two-stream scheme (referred to as VA-fusion) that fuses the scores of the two networks to provide the final prediction, obtaining enhanced performance.

Make Skeleton-based Action Recognition Model Smaller, Faster and Better

A Double-feature Double-motion Network (DD-Net) for skeleton-based action recognition, which can reach a super fast speed, and achieves state-of-the-art performance on experiment datasets: SHREC and JHMDB.

Action2Vec: A Crossmodal Embedding Approach to Action Learning

A novel cross-modal embedding space for actions, named Action2Vec, which combines linguistic cues from class labels with spatio-temporal features derived from video clips, and is the first to be thoroughly evaluated with respect to its distributional semantics.

Global Context-Aware Attention LSTM Networks for 3D Action Recognition

This work proposes a new class of LSTM network, Global Context-Aware Attention L STM (GCA-LSTM), for 3D action recognition, which is able to selectively focus on the informative joints in the action sequence with the assistance of global contextual information.

NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding

This work introduces a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames, and investigates a novel one-shot 3D activity recognition problem on this dataset.

On Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks

This work provides a simple universal spatial modeling method perpendicular to the RNN model enhancement and selects a set of simple geometric features, motivated by the evolution of previous work, to outperform other features and achieve state-of-art results on four datasets.

Spatio-Temporal Phrases for Activity Recognition

This paper proposes an approach that efficiently identifies both local and long-range motion interactions and can capture the combination of the hand movement of one person and the foot response of another person, the local features of which are both spatially and temporally far away from each other.