MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

@article{Zhang2022MixSTESM,
  title={MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video},
  author={Jinlu Zhang and Zhigang Tu and Jianyu Yang and Yujin Chen and Junsong Yuan},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.00859}
}
Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames glob-ally to learn spatio-temporal correlation. We observe that the motions of different joints differ significantly. How-ever, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation. We propose MixSTE (Mixed Spatio-Temporal Encoder… 
Meta Agent Teaming Active Learning for Pose Estimation
TLDR
A novel Meta Agent Teaming Active Learning (MATAL) framework to actively select and label informative images for effective learning and can save around 40% labeling efforts on average compared to state-of-the-art active learning frameworks.

References

SHOWING 1-10 OF 62 REFERENCES
Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation
TLDR
An improved Transformer-based architecture is proposed for 3D human pose estimation in videos to lift a sequence of 2D joint locations to a 3D pose, and achieves state-of-the-art results with much fewer parameters.
Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction
TLDR
An attentional mechanism to adaptively identify significant frames and tensor outputs from each deep neural net layer, leading to a more optimal estimation of 3D human pose estimation from a monocular video is designed.
3D Human Pose Estimation with Spatial and Temporal Transformers
TLDR
PoseFormer is presented, a purely transformer-based approach for 3D human pose estimation in videos without convolutional architectures involved, designed to comprehensively model the human joint relations within each frame as well as the temporal correlations across frames.
Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks
TLDR
A novel graph-based method to tackle the problem of 3D human body and 3D hand pose estimation from a short sequence of 2D joint detections, where domain knowledge about the human hand (body) configurations is explicitly incorporated into the graph convolutional operations to meet the specific demand of the 3D pose estimation.
Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation
TLDR
A deep learning-based framework that utilizes matrix factorization for sequential 3d human poses estimation and demonstrates the effectiveness of the framework on long sequences by achieving state-of-the-art performances on multiple benchmark datasets.
3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce
Exploiting Temporal Information for 3D Human Pose Estimation
TLDR
A sequence-to-sequence network composed of layer-normalized LSTM units with shortcut connections connecting the input to the output on the decoder side and imposed temporal smoothness constraint during training is designed, which helps the network to recover temporally consistent 3D poses over a sequence of images even when the 2D pose detector fails.
Anatomy-Aware 3D Human Pose Estimation With Bone-Based Pose Decomposition
TLDR
This work proposes a new solution to 3D human pose estimation in videos by drawing inspiration from the human skeleton anatomy and decompose the task into bone direction prediction and bone length prediction, from which the 3D joint locations can be completely derived.
Ordinal Depth Supervision for 3D Human Pose Estimation
TLDR
This work proposes to use a weaker supervision signal provided by the ordinal depths of human joints, which achieves new state-of-the-art performance for the relevant benchmarks and validate the effectiveness of ordinal depth supervision for 3D human pose.
TransPose: Towards Explainable Human Pose Estimation by Transformer
TLDR
An explainable model named TransPose is constructed based on Transformer architecture and low-level convolutional blocks that achieves state-of-the-art performance on COCO dataset, while being more interpretable, lightweight, and efficient than mainstream fully Convolutional architectures.
...
...