MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

  title={MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video},
  author={Jinlu Zhang and Zhigang Tu and Jianyu Yang and Yujin Chen and Junsong Yuan},
Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames glob-ally to learn spatio-temporal correlation. We observe that the motions of different joints differ significantly. How-ever, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation. We propose MixSTE (Mixed Spatio-Temporal Encoder… 
Meta Agent Teaming Active Learning for Pose Estimation
A novel Meta Agent Teaming Active Learning (MATAL) framework to actively select and label informative images for effective learning and can save around 40% labeling efforts on average compared to state-of-the-art active learning frameworks.


Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation
An improved Transformer-based architecture is proposed for 3D human pose estimation in videos to lift a sequence of 2D joint locations to a 3D pose, and achieves state-of-the-art results with much fewer parameters.
Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction
An attentional mechanism to adaptively identify significant frames and tensor outputs from each deep neural net layer, leading to a more optimal estimation of 3D human pose estimation from a monocular video is designed.
3D Human Pose Estimation with Spatial and Temporal Transformers
PoseFormer is presented, a purely transformer-based approach for 3D human pose estimation in videos without convolutional architectures involved, designed to comprehensively model the human joint relations within each frame as well as the temporal correlations across frames.
Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks
A novel graph-based method to tackle the problem of 3D human body and 3D hand pose estimation from a short sequence of 2D joint detections, where domain knowledge about the human hand (body) configurations is explicitly incorporated into the graph convolutional operations to meet the specific demand of the 3D pose estimation.
3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce
Exploiting Temporal Information for 3D Human Pose Estimation
A sequence-to-sequence network composed of layer-normalized LSTM units with shortcut connections connecting the input to the output on the decoder side and imposed temporal smoothness constraint during training is designed, which helps the network to recover temporally consistent 3D poses over a sequence of images even when the 2D pose detector fails.
Anatomy-Aware 3D Human Pose Estimation With Bone-Based Pose Decomposition
This work proposes a new solution to 3D human pose estimation in videos by drawing inspiration from the human skeleton anatomy and decompose the task into bone direction prediction and bone length prediction, from which the 3D joint locations can be completely derived.
Ordinal Depth Supervision for 3D Human Pose Estimation
This work proposes to use a weaker supervision signal provided by the ordinal depths of human joints, which achieves new state-of-the-art performance for the relevant benchmarks and validate the effectiveness of ordinal depth supervision for 3D human pose.
Context Modeling in 3D Human Pose Estimation: A Unified Perspective
This work proposes ContextPose based on attention mechanism that allows enforcing soft limb length constraints in a deep network and effectively reduces the chance of getting absurd 3D pose estimates with incorrect limb lengths and achieves state-of-the-art results on two benchmark datasets.
Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation
This work proposes a hop-aware hierarchical channel-squeezing fusion layer to effectively extract relevant information from neighboring nodes while suppressing undesired noises in GNN learning and proposes a temporal-aware dynamic graph construction procedure that is robust and effective for 3D pose estimation.