Skeletor: Skeletal Transformers for Robust Body-Pose Estimation

  title={Skeletor: Skeletal Transformers for Robust Body-Pose Estimation},
  author={Tao Jiang and Necati Cihan Camgoz and R. Bowden},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
  • Tao Jiang, N. C. Camgoz, R. Bowden
  • Published 23 April 2021
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Predicting 3D human pose from a single monoscopic video can be highly challenging due to factors such as low resolution, motion blur and occlusion, in addition to the fundamental ambiguity in estimating 3D from 2D. Approaches that directly regress the 3D pose from independent images can be particularly susceptible to these factors and result in jitter, noise and/or inconsistencies in skeletal estimation. Much of which can be overcome if the temporal evolution of the scene and skeleton are taken… Expand

Figures and Tables from this paper

Leveraging MoCap Data for Human Mesh Recovery
It is found that fine-tune image-based models with synthetic renderings from MoCap data can increase their performance, by providing them with a wider variety of poses, textures and backgrounds, and it is shown that simply fine-tuning the batch normalization layers of the model is enough to achieve large gains. Expand
A Survey on Vision Transformer
This paper reviews these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages, and takes a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer. Expand
BBC-Oxford British Sign Language Dataset
This work describes several strengths and limitations of the data from the perspectives of machine learning and linguistics, note sources of bias present in the dataset, and discuss potential applications of BOBSL in the context of sign language technology. Expand


A Simple Yet Effective Baseline for 3d Human Pose Estimation
The results indicate that a large portion of the error of modern deep 3d pose estimation systems stems from their visual analysis, and suggests directions to further advance the state of the art in 3d human pose estimation. Expand
Weakly-Supervised 3D Pose Estimation from a Single Image using Multi-View Consistency
We present a novel data-driven regularizer for weakly-supervised learning of 3D human pose estimation that eliminates the drift problem that affects existing approaches. We do this by moving theExpand
OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields
OpenPose is released, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints, and the first combined body and foot keypoint detector, based on an internal annotated foot dataset. Expand
Monocular Total Capture: Posing Face, Body, and Hands in the Wild
This work presents the first method to capture the 3D total motion of a target person from a monocular view input, and leverages a 3D deformable human model to reconstruct total body pose from the CNN outputs with the aid of the pose and shape prior in the model. Expand
RMPE: Regional Multi-person Pose Estimation
This paper proposes a novel regional multi-person pose estimation (RMPE) framework to facilitate pose estimation in the presence of inaccurate human bounding boxes and can achieve 76:7 mAP on the MPII (multi person) dataset. Expand
Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields
We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learnExpand
Pose Machines: Articulated Pose Estimation via Inference Machines
This paper builds upon the inference machine framework and presents a method for articulated human pose estimation that incorporates rich spatial interactions among multiple parts and information across parts of different scales and outperforms the state-of-the-art on these benchmarks. Expand
Convolutional Pose Machines
This work designs a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations, without the need for explicit graphical model-style inference in structured prediction tasks such as articulated pose estimation. Expand
Expressive Body Capture: 3D Hands, Face, and Body From a Single Image
This work uses the new method, SMPLify-X, to fit SMPL-X to both controlled images and images in the wild, and evaluates 3D accuracy on a new curated dataset comprising 100 images with pseudo ground-truth. Expand
Deep High-Resolution Representation Learning for Human Pose Estimation
This paper proposes a network that maintains high-resolution representations through the whole process of human pose estimation and empirically demonstrates the effectiveness of the network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. Expand