Action2Motion: Conditioned Generation of 3D Human Motions

@article{Guo2020Action2MotionCG,
  title={Action2Motion: Conditioned Generation of 3D Human Motions},
  author={Chuan Guo and Xinxin Zuo and Sen Wang and Shihao Zou and Qingyao Sun and Annan Deng and Minglun Gong and Li Cheng},
  journal={Proceedings of the 28th ACM International Conference on Multimedia},
  year={2020}
}
  • Chuan Guo, X. Zuo, Li Cheng
  • Published 30 July 2020
  • Computer Science
  • Proceedings of the 28th ACM International Conference on Multimedia
Action recognition is a relatively established task, where given an input sequence of human motion, the goal is to predict its action category. This paper, on the other hand, considers a relatively new problem, which could be thought of as an inverse of action recognition: given a prescribed action type, we aim to generate plausible human motion sequences in 3D. Importantly, the set of generated motions are expected to maintain its diversity to be able to explore the entire action-conditioned… 
Action-Conditioned 3D Human Motion Synthesis with Transformer VAE
TLDR
This work designs a Transformer-based architecture, ACTOR, for encoding and decoding a sequence of parametric SMPL human body models estimated from action recognition datasets, and learns an action-aware latent representation for human motions by training a generative variational autoencoder (VAE).
Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction
TLDR
The task of action-driven stochastic human motion prediction is introduced, which aims to predict multiple plausible future motions given a sequence of action labels and a short motion history, and a VAE-based model conditioned on both the observed motion and the action label sequence is designed.
Scene-aware Generative Network for Human Motion Synthesis
TLDR
This paper proposes a new framework, which factorizes the distribution of human motions into a distribution of movement trajectories conditioned on scenes and that of body pose dynamics conditioned on both scenes and trajectories, and derives a GAN-based learning approach.
Implicit Neural Representations for Variable Length Human Motion Generation
TLDR
It is shown that variable-length motions generated by the proposed action-conditional human motion generation method are better than fixed-length motion generation by the state-of-the-art method in terms of realism and diversity.
HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE
TLDR
This paper proposes Hierarchical Transformer Dynamical Variational Variational Autoencoder, HiT-DVAE, which implements auto-regressive generation with transformer-like attention mechanisms, thus enabling the generative model to learn a more complex and time-varying latent space as well as diverse and realistic human motions.
A Unified 3D Human Motion Synthesis Model via Conditional Variational Auto-Encoder∗
TLDR
A unified framework based on Conditional Variational Auto-Encoder (CVAE), where any arbitrary input is treated as a masked motion series and a parametric distribution of the missing regions based on the input conditions is estimated.
TEMOS: Generating diverse human motions from textual descriptions
TLDR
This work proposes TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data, in combination with a text encoder that produces distribution parameters compatible with the VAE latent space.
ActFormer: A GAN Transformer Framework towards General Action-Conditioned 3D Human Motion Generation
TLDR
This work presents a GAN Transformer framework for general action-conditioned 3D human motion generation, including not only single-person actions but also multi-person interactive actions, and demonstrates adaptability to various human motion representations.
BABEL: Bodies, Action and Behavior with English Labels
TLDR
BABEL is presented, a large dataset with language labels describing the actions being performed in mocap sequences, and can serve as a useful benchmark for progress in 3D action recognition.
Towards Diverse and Natural Scene-aware 3D Human Motion Synthesis
TLDR
A hierarchical framework is proposed, with each sub-module responsible for modeling one aspect of the diversity of scene-aware human motions, and the results show that the proposed framework remarkably outperforms previous methods in terms of diversity and naturalness.
...
...

References

SHOWING 1-10 OF 38 REFERENCES
MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics
TLDR
This work presents a novel Motion Transformation Variational Auto-Encoders (MT-VAE) for learning motion sequence generation that jointly learns a feature embedding for motion modes and a feature transformation that represents the transition of one motion mode to the next motion mode.
Towards Natural and Accurate Future Motion Prediction of Humans and Animals
TLDR
A hierarchical recurrent network structure is developed to simultaneously encodes local contexts of individual frames and global contexts of the sequence, which achieves more natural and accurate predictions over state-of-the-art methods.
Adversarial Geometry-Aware Human Motion Prediction
TLDR
This work proposes a novel frame-wise geodesic loss as a geometrically meaningful, more precise distance measurement and presents a new learning procedure to simultaneously validate the sequence-level plausibility of the prediction and its coherence with the input sequence by introducing two global recurrent discriminators.
VIBE: Video Inference for Human Body Pose and Shape Estimation
TLDR
This work defines a novel temporal network architecture with a self-attention mechanism and shows that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels.
Language2Pose: Natural Language Grounded Pose Forecasting
TLDR
This paper introduces a neural architecture called Joint Language-to-Pose (or JL2P), which learns a joint embedding of language and pose and evaluates the proposed model on a publicly available corpus of 3D pose data and human-annotated sentences.
View invariant human action recognition using histograms of 3D joints
TLDR
This paper presents a novel approach for human action recognition with histograms of 3D joint locations (HOJ3D) as a compact representation of postures and achieves superior results on the challenging 3D action dataset.
Deep Video Generation, Prediction and Completion of Human Action Sequences
TLDR
This paper proposes a general, two-stage deep framework to generate human action videos with no constraints or arbitrary number of constraints, which uniformly address the three problems: video generation given no input frames, video prediction given the first few frames, and video completionGiven the first and last frames.
Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations
TLDR
A novel approach to human action recognition from 3D skeleton sequences extracted from depth data that uses the covariance matrix for skeleton joint locations over time as a discriminative descriptor for a sequence to encode the relationship between joint movement and time.
Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group
TLDR
A new skeletal representation that explicitly models the 3D geometric relationships between various body parts using rotations and translations in 3D space is proposed and outperforms various state-of-the-art skeleton-based human action recognition approaches.
...
...