Motion Capture from Internet Videos

  title={Motion Capture from Internet Videos},
  author={Junting Dong and Qing Shuai and Y. Zhang and Xian Liu and Xiaowei Zhou and Hujun Bao},
Recent advances in image-based human pose estimation make it possible to capture 3D human motion from a single RGB video. However, the inherent depth ambiguity and self-occlusion in a single view prohibit the recovery of as high-quality motion as multi-view reconstruction. While multi-view videos are not common, the videos of a celebrity performing a specific action are usually abundant on the Internet. Even if these videos were recorded at different time instances, they would encode the same… Expand

Figures and Tables from this paper

Learning Motion Priors for 4D Human Body Capture in 3D Scenes
Recovering high-quality 3D human motion in complex scenes from monocular videos is important for many applications, ranging from AR/VR to robotics. However, capturing realistic human-sceneExpand
Human Mesh Recovery from Multiple Shots
An insight that while shot changes of the same scene incur a discontinuity between frames, the 3D structure of the scene still changes smoothly is addressed, which allows us to handle frames before and after the shot change as multi-view signal that provide strong cues to recover the3D state of the actors. Expand
Consistent 3D Human Shape from Repeatable Action
We introduce a novel method for reconstructing the 3D human body from a video of a person in action. Our method recovers a single clothed body model that can explain all frames in the input. OurExpand
Reconstructing 3D Human Pose by Watching Humans in the Mirror
An optimization-based approach is developed that exploits mirror symmetry constraints for accurate 3D pose reconstruction and provides a method to estimate the surface normal of the mirror from vanishing points in the single image. Expand
Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization
A multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations and a simple temporal modeling module from multi- level features to enhance motion pattern learning is proposed. Expand
Graph Matching for Marker Labeling and Missing Marker Reconstruction With Bone Constraint by LSTM in Optical Motion Capture
A novel graph matching method is employed to determine the connection relationship of the scattered motion data for a single frame and a new motion data preprocessing method considering the bone length constraint is proposed considering the information of variation in the relative position of adjacent markers. Expand
A-NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering
Fig. 1. Our A-NeRF test-time optimization for monocular 3D human pose estimation jointly learns a volumetric body model of the user that can be animated and works with diverse body shapes (left),Expand
Animatable Neural Radiance Fields for Human Body Modeling
This paper addresses the challenge of reconstructing an animatable human model from a multi-view video by introducing neural blend weight fields to produce the deformation fields and shows that this approach significantly outperforms recent human synthesis methods. Expand
Learning Compositional Representation for 4D Captures with Neural ODE
This paper introduces a compositional representation for 4D captures, i.e. a deforming 3D object over a temporal span, that disentangles shape, initial state, and motion respectively and proposes an Identity Exchange Training (IET) strategy to encourage the network to learn effectively decoupling each component. Expand
Learning Transferable Kinematic Dictionary for 3D Human Pose and Shape Reconstruction
Estimating 3D human pose and shape from a single image is highly under-constrained. To address this ambiguity, we propose a novel prior, namely kinematic dictionary, which explicitly regularizes theExpand


Outdoor Human Motion Capture by Simultaneous Optimization of Pose and Camera Parameters
The approach is able to track multiple people even in front of cluttered and non‐static backgrounds, and unsynchronized cameras with varying image quality and frame rate, and can be adopted in many practical applications to replace the complex and expensive motion capture studios with few consumer‐grade cameras even in uncontrolled outdoor scenes. Expand
Learning 3D Human Dynamics From Video
The approach is designed so it can learn from videos with 2D pose annotations in a semi-supervised manner and obtain state-of-the-art performance on the 3D prediction task without any fine-tuning. Expand
Towards Accurate Marker-Less Human Shape and Pose Estimation over Time
  • Yinghao Huang
  • Computer Science
  • 2017 International Conference on 3D Vision (3DV)
  • 2017
This work presents a fully automatic method that, given multi-view videos, estimates 3D human pose and body shape and takes the recently proposed SMPLify method as the base method and extends it in several ways. Expand
Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video
This paper addresses the challenge of 3D full-body human pose estimation from a monocular image sequence with a novel approach that integrates a sparsity-driven 3D geometric prior and temporal smoothness and outperforms a publicly available 2D pose estimation baseline on the challenging PennAction dataset. Expand
Monocular Total Capture: Posing Face, Body, and Hands in the Wild
This work presents the first method to capture the 3D total motion of a target person from a monocular view input, and leverages a 3D deformable human model to reconstruct total body pose from the CNN outputs with the aid of the pose and shape prior in the model. Expand
Outdoor Markerless Motion Capture with Sparse Handheld Video Cameras
This work exploits the generative motion capture methods and proposes a novel model-view consistency that considers both foreground and background in the tracking stage, which outperforms several alternative methods on various examples demonstrated in the paper. Expand
Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras
This approach unites a discriminative image-based joint detection method with a model-based generative motion tracking algorithm through a combined pose optimization energy that enables to track full articulated joint angles at state-of-the-art accuracy and temporal stability with a very low number of cameras. Expand
3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduceExpand
Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes: The Importance of Multiple Scene Constraints
This paper leverage state-of-the-art deep multi-task neural networks and parametric human and scene modeling, towards a fully automatic monocular visual sensing system for multiple interacting people, which infers the 2d and 3d pose and shape of multiple people from a single image. Expand
Time-Contrastive Networks: Self-Supervised Learning from Video
A self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints is proposed, and it is demonstrated that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be use as a reward function within a reinforcement learning algorithm. Expand