Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation

  title={Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation},
  author={Jin-Hui Park and Kimathi Kaai and Saad Hossain and Norikatsu Sumi and Sirisha Rambhatla and Paul W. Fieguth},
Egocentric 3D human pose estimation (HPE) from images is challenging due to severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera. Although existing works use intermediate heatmap-based representations to counter distortion with some success, addressing self-occlusion remains an open problem. In this work, we leverage information from past frames to guide our self-attention-based 3D HPE estimation procedure — Ego-STAN. Specifically, we build a… 

Figures and Tables from this paper


xR-EgoPose: Egocentric 3D Human Pose From an HMD Camera
A new solution to egocentric 3D body pose estimation from monocular images captured from a downward looking fish-eye camera installed on the rim of a head mounted virtual reality device, using a new encoder-decoder architecture with a novel dual branch decoder designed specifically to account for the varying uncertainty in the 2D joint locations.
Mo2Cap2: Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera
This work proposes the first real-time system for the egocentric estimation of 3D human body pose in a wide range of unconstrained everyday activities and achieves lower 3D joint error as well as better 2D overlay than the existing baselines.
3D Human Pose Estimation with Spatial and Temporal Transformers
PoseFormer is presented, a purely transformer-based approach for 3D human pose estimation in videos without convolutional architectures involved, designed to comprehensively model the human joint relations within each frame as well as the temporal correlations across frames.
Estimating Egocentric 3D Human Pose in Global Space
To achieve accurate and temporally stable global poses, a spatio-temporal optimization is performed over a sequence of frames by minimizing heatmap reprojection errors and enforcing local and global body motion priors learned from a mocap dataset.
Ordinal Depth Supervision for 3D Human Pose Estimation
This work proposes to use a weaker supervision signal provided by the ordinal depths of human joints, which achieves new state-of-the-art performance for the relevant benchmarks and validate the effectiveness of ordinal depth supervision for 3D human pose.
HEMlets Pose: Learning Part-Centric Heatmap Triplets for Accurate 3D Human Pose Estimation
This work attempts to address the uncertainty of lifting the detected 2D joints to the 3D space by introducing an intermediate state - Part-Centric Heatmap Triplets (HEMlets), which shortens the gap between the 2D observation and the3D interpretation.
A Simple Yet Effective Baseline for 3d Human Pose Estimation
The results indicate that a large portion of the error of modern deep 3d pose estimation systems stems from their visual analysis, and suggests directions to further advance the state of the art in 3d human pose estimation.
Automatic Calibration of the Fisheye Camera for Egocentric 3D Human Pose Estimation from a Single Image
This paper proposes a method that first estimates 3D joint locations of a human in camera coordinates, then uses the automatic calibration to further regularize the 3D predictions and achieves state-of-the-art performance.
Structured Prediction of 3D Human Pose with Deep Neural Networks
This paper introduces a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images that relies on an overcomplete autoencoder to learn a high-dimensional latent pose representation and account for joint dependencies.
3D Human Pose Estimation = 2D Pose Estimation + Matching
This work explores 3D human pose estimation from a single RGB image using a simple architecture that reasons through intermediate 2D pose predictions, and demonstrates how this architecture is straightforward to implement with off-the-shelf2D pose estimation systems and 3D mocap libraries.