Recent Advances of Monocular 2D and 3D Human Pose Estimation: A Deep Learning Perspective

@article{Liu2022RecentAO,
  title={Recent Advances of Monocular 2D and 3D Human Pose Estimation: A Deep Learning Perspective},
  author={Wu Liu and Qian Bao and Yu Sun and Tao Mei},
  journal={ACM Computing Surveys (CSUR)},
  year={2022}
}
Estimation of the human pose from a monocular camera has been an emerging research topic in the computer vision community with many applications. Recently, benefiting from the deep learning technologies, a significant amount of research efforts have advanced the monocular human pose estimation both in 2D and 3D areas. Although there have been some works to summarize different approaches, it still remains challenging for researchers to have an in-depth view of how these approaches work from 2D… 
Recovering 3D Human Mesh from Monocular Images: A Survey
TLDR
This is the first survey to focus on the task of monocular 3D human mesh recovery and starts with the introduction of body models and then elaborate recovery frameworks and training objectives by providing in-depth analyses of their strengths and weaknesses.
CLIFF: Carrying Location Information in Full Frames into Human Pose and Shape Estimation
TLDR
A pseudoground-truth annotator based on CLIFF is proposed, which provides high-quality 3D annotations for in-the-wild 2D datasets and offers crucial full supervision for regression-based methods.
Monocular, One-stage, Regression of Multiple 3D People
TLDR
ROMP is the first real-time implementation of monocular multi-person 3D mesh regression, and achieves superior performance on the challenging multi- person benchmarks, including 3DPW and CMU Panoptic.
DeciWatch: A Simple Baseline for 10x Efficient 2D and 3D Pose Estimation
TLDR
Comprehensive experimental results on three video-based human pose estimation, body mesh recovery tasks and efficient labeling in videos with four datasets validate the e-ciency and e-ectiveness of DeciWatch.
Adaptive Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation.
TLDR
A unified framework to adaptively handle varying view numbers and video length without camera calibration in 3D Human Pose Estimation (HPE), which obtains competitive results and generalizes well to dynamic capture with an arbitrary number of unseen views.
Learning Human Kinematics by Modeling Temporal Correlations between Joints for Video-based Human Pose Estimation
TLDR
A plug-and-play kinematics modeling mod- ule (KMM) based on the domain-cross attention mechanism to model the temporal correlation between joints across different frames explicitly and a network based on this method for obtaining the initial positions of joints by combining pose features and initial locations of joints.
Gait Recognition in the Wild with Dense 3D Representations and A Benchmark
TLDR
This paper proposes a novel framework to explore the 3D Skinned Multi-Person Linear (SMPL) model of the human body for gait recognition, named SMPLGait, and provides 3D SMPL models recovered from video frames which can provide dense 3D information of body shape, viewpoint, and dynamics.
PosePipe: Open-Source Human Pose Estimation Pipeline for Clinical Research
TLDR
A human pose estimation pipeline that facilitates running state-of-the-art algorithms on data acquired in clinical context and facilitates analyzing large numbers of videos of human movement ranging from gait laboratories analyses, to clinic and therapy visits, to people in the community is developed.
3DMesh-GAR: 3D Human Body Mesh-Based Method for Group Activity Recognition
TLDR
3DMesh-GAR is proposed, a novel approach to 3D human body Mesh-based Group Activity Recognition, which relies on a body center heatmap, camera map, and mesh parameter map instead of the complex and noisy 3D skeleton of each person of the input frames.
Neural Architecture Search for Joint Human Parsing and Pose Estimation
TLDR
This work proposes to search for an efficient network architecture (NPPNet) to tackle two tasks at the same time and embed NAS units in both multi-scale feature interaction and high-level feature fusion to establish optimal connections between two tasks.
...
...

References

SHOWING 1-10 OF 300 REFERENCES
Human Pose Estimation from Monocular Images: A Comprehensive Survey
TLDR
A comprehensive survey of human pose estimation from monocular images is carried out including milestone works and recent advancements and splits the problem into several modules: feature extraction and description, human body models, and modeling methods.
Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video
TLDR
This paper addresses the challenge of 3D full-body human pose estimation from a monocular image sequence with a novel approach that integrates a sparsity-driven 3D geometric prior and temporal smoothness and outperforms a publicly available 2D pose estimation baseline on the challenging PennAction dataset.
Learning Monocular 3D Human Pose Estimation from Multi-view Images
TLDR
This paper trains the system to predict the same pose in all views, and proposes a method to estimate camera pose jointly with human pose, which lets us utilize multiview footage where calibration is difficult, e.g., for pan-tilt or moving handheld cameras.
Monocular human pose estimation: A survey of deep learning-based methods
A Simple Yet Effective Baseline for 3d Human Pose Estimation
TLDR
The results indicate that a large portion of the error of modern deep 3d pose estimation systems stems from their visual analysis, and suggests directions to further advance the state of the art in 3d human pose estimation.
Multiview-Consistent Semi-Supervised Learning for 3D Human Pose Estimation
TLDR
This work proposes Multiview-Consistent Semi Supervised Learning (MCSS) framework that utilizes similarity in pose information from unannotated, uncalibrated but synchronized multi-view videos of human motions as additional weak supervision signal to guide 3D human pose regression.
Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes: The Importance of Multiple Scene Constraints
TLDR
This paper leverage state-of-the-art deep multi-task neural networks and parametric human and scene modeling, towards a fully automatic monocular visual sensing system for multiple interacting people, which infers the 2d and 3d pose and shape of multiple people from a single image.
Learning 3D Human Pose from Structure and Motion
TLDR
This work proposes two anatomically inspired loss functions and uses them with a weakly-supervised learning framework to jointly learn from large-scale in-the-wild 2D and indoor/synthetic 3D data and presents a simple temporal network that exploits temporal and structural cues present in predicted pose sequences to temporally harmonize the pose estimations.
Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image Synthesis
TLDR
This work proposes a self-supervised learning framework to disentangle variations from unlabeled video frames, and demonstrates state-of-the-art weakly- supervised 3D pose estimation performance on both Human3.6M and MPI-INF-3DHP datasets.
...
...