Neural monocular 3D human motion capture with physical awareness

  title={Neural monocular 3D human motion capture with physical awareness},
  author={Soshi Shimada and Vladislav Golyanik and Weipeng Xu and Patrick P'erez and Christian Theobalt},
  journal={ACM Transactions on Graphics (TOG)},
  pages={1 - 15}
We present a new trainable system for physically plausible markerless 3D human motion capture, which achieves state-of-the-art results in a broad range of challenging scenarios. Unlike most neural methods for human motion capture, our approach, which we dub "physionical", is aware of physical and environmental constraints. It combines in a fully-differentiable way several key innovations, i.e., 1) a proportional-derivative controller, with gains predicted by a neural network, that reduces… 

Figures and Tables from this paper

Gravity-Aware Monocular 3D Human-Object Reconstruction
GraviCap is a new approach for joint markerless 3D human motion capture and object trajectory estimation from monocular RGB videos that can recover scale, object trajectories as well as human bone lengths in meters and the ground plane’s orientation, thanks to the awareness of the gravity constraining object motions.
Exploring Versatile Prior for Human Motion via Motion Frequency Guidance
This paper designs a framework to learn the versatile motion prior, which models the inherent probability distribution of human motions, and proposes a global orientation normalization to remove redundant environment information in the original motion data space.
D-Grasp: Physically Plausible Dynamic Grasp Synthesis for Hand-Object Interactions
This work proposes a novel method that frames the dynamic grasp synthesis task in the reinforcement learning framework and leverages a physics simulation, both to learn and to evaluate such dynamic interactions.


Neural monocular 3D human motion capture with physical awareness
A new trainable system for physically plausible markerless 3D human motion capture, which achieves state-of-the-art results in a broad range of challenging scenarios and is aware of physical and environmental constraints.
Joint 3D Human Motion Capture and Physical Analysis from Monocular Videos
This work proposes an algorithm combining monocular 3D pose estimation with physics-based modeling to introduce a statistical framework for fast and robust 3D motion analysis from 2D video-data.
Video-based 3D motion capture through biped control
This work estimates human motion from monocular video by recovering three-dimensional controllers capable of implicitly simulating the observed human behavior and replaying this behavior in other environments and under physical perturbations by employing a state-space biped controller with balance feedback mechanism.
MonoPerfCap: Human Performance Capture from Monocular Video
This work presents the first marker-less approach for temporally coherent 3D performance capture of a human with general clothing from monocular video that significantly outperforms previous monocular methods in terms of accuracy, robustness and scene complexity that can be handled.
Contact and Human Dynamics from Monocular Video
A physics-based method for inferring 3D human motion from video sequences that takes initial 2D and 3D pose estimates as input and produces motions that are significantly more realistic than those from purely kinematic methods, substantially improving quantitative measures of both kinematics and dynamic plausibility.
MotioNet: 3D Human Motion Reconstruction from Monocular Video with Skeleton Consistency
MotioNet, a deep neural network that directly reconstructs the motion of a 3D human skeleton from monocular video, is introduced, the first data-driven approach that directly outputs a kinematic skeleton, which is a complete, commonly used, motion representation.
Learning 3D Human Pose from Structure and Motion
This work proposes two anatomically inspired loss functions and uses them with a weakly-supervised learning framework to jointly learn from large-scale in-the-wild 2D and indoor/synthetic 3D data and presents a simple temporal network that exploits temporal and structural cues present in predicted pose sequences to temporally harmonize the pose estimations.
VIBE: Video Inference for Human Body Pose and Shape Estimation
This work defines a novel temporal network architecture with a self-attention mechanism and shows that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels.
Real-time Physics-based Motion Capture with Sparse Sensors
This work proposes a framework for real-time tracking of humans using sparse multi-modal sensor sets, including data obtained from optical markers and inertial measurement units, and shows that the system can track and simulate a wide range of dynamic movements including bipedal gait, ballistic movements such as jumping, and interaction with the environment.
Structure from Articulated Motion: Accurate and Stable Monocular 3D Reconstruction without Training Data
This paper introduces a model-based method called Structure from Articulated Motion (SfAM), which can recover multiple object and motion types without training on extensive data collections and believes that it brings a new perspective on the domain of monocular 3D recovery of articulated structures, including human motion capture.