• Corpus ID: 229297513

Human Mesh Recovery from Multiple Shots

@article{Pavlakos2020HumanMR,
  title={Human Mesh Recovery from Multiple Shots},
  author={Georgios Pavlakos and Jitendra Malik and Angjoo Kanazawa},
  journal={ArXiv},
  year={2020},
  volume={abs/2012.09843}
}
Videos from edited media like movies are a useful, yet under-explored source of information. The rich variety of appearance and interactions between humans depicted over a large temporal context in these films could be a valuable source of data. However, the richness of data comes at the expense of fundamental challenges such as abrupt shot changes and close up shots of actors with heavy truncation, which limits the applicability of existing human 3D understanding methods. In this paper, we… 
Playing for 3D Human Recovery
TLDR
This work contributes, GTA-Human, a mega-scale and highly-diverse 3D human dataset generated with the GTAV game engine and systematically investigates the performance of various methods under a wide spectrum of real-world variations, e.g. camera angles, poses, and occlusions.
HPOF:3D Human Pose Recovery from Monocular Video with Optical Flow
TLDR
HBPOF is introduced, a novel deep neural network to reconstruct the 3D human motion from a monocular video that not only improves the accuracy of 3D poses but ensures the realistic body structure throughout the video.
Recovering 3D Human Mesh from Monocular Images: A Survey
TLDR
This is the first survey to focus on the task of monocular 3D human mesh recovery and starts with the introduction of body models and then elaborate recovery frameworks and training objectives by providing in-depth analyses of their strengths and weaknesses.
Leveraging MoCap Data for Human Mesh Recovery
TLDR
It is found that fine-tune image-based models with synthetic renderings from MoCap data can increase their performance, by providing them with a wider variety of poses, textures and backgrounds, and it is shown that simply fine-tuning the batch normalization layers of the model is enough to achieve large gains.
FLEX: Parameter-free Multi-view 3D Human Motion Reconstruction
TLDR
This work introduces FLEX (Free muLti-view rEconstruXion), an end-to-end parameter-free multi-view model that outperforms state-of-the-art methods that are not parameter- free and shows that in the absence of camera parameters, it outperforms them by a large margin while obtaining comparable results when camera parameters are available.
Learning Where to Cut from Edited Videos
TLDR
This work validate that there is indeed a consensus among human viewers about good and bad cut moments with a user study, and proposes a contrastive learning framework to train a 3D ResNet model to predict good regions to cut.
FLEX: Extrinsic Parameter-free Multi-view 3D Human Motion Reconstruction
TLDR
FLEX (Free muLti-view rEconstruXion), an end-to-end extrinsic parameter-free multi-view model that outperforms state-of-the-art methods that are not ep-free and shows that in the absence of camera parameters, it outperforms them by a large margin while obtaining comparable results when camera parameters are available.
MovieCuts: A New Dataset and Benchmark for Cut Type Recognition
TLDR
The cut type recognition task, which requires modeling of multi-modal information, is introduced and a large-scale dataset called MovieCuts is constructed, which contains more than 170K video clips labeled among ten cut types.
Tracking People with 3D Representations
TLDR
A method is developed which in addition to extracting the 3D geometry of the person as a SMPL mesh, also extracts appearance as a texture map on the triangles of the mesh, which serves as a 3D representation for appearance that is robust to viewpoint and pose changes.
Mesh Graphormer
TLDR
Experimental results show that the proposed method, Mesh Graphormer, significantly outperforms the previous state-of-the-art methods on multiple benchmarks, including Human3.6M, 3DPW, and FreiHAND datasets.
...
1
2
...

References

SHOWING 1-10 OF 54 REFERENCES
Learning 3D Human Dynamics From Video
TLDR
The approach is designed so it can learn from videos with 2D pose annotations in a semi-supervised manner and obtain state-of-the-art performance on the 3D prediction task without any fine-tuning.
Motion Capture from Internet Videos
TLDR
This work proposes a novel optimization-based framework and experimentally demonstrates its ability to recover much more precise and detailed motion from multiple videos, compared against monocular motion capture methods.
Human Mesh Recovery from Monocular Images via a Skeleton-disentangled Representation
TLDR
A skeleton-disentangled based framework is proposed, which divides this task into multi-level spatial and temporal granularity in a decoupling manner, and an effective and pluggable "disentangling the skeleton from the details" (DSD) module is proposed.
Full-Body Awareness from Partial Observations
TLDR
A simple but highly effective self-training framework that adapts human 3D mesh recovery systems to consumer videos and demonstrates its application to two recent systems is proposed.
Person Search in Videos with One Portrait Through Visual and Temporal Links
TLDR
A novel framework is proposed, which takes into account the identity invariance along a tracklet, thus allowing person identities to be propagated via both the visual and the temporal links and remarkably outperforms mainstream person re-id methods.
Exploiting Temporal Context for 3D Human Pose Estimation in the Wild
TLDR
A bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos and shows that retraining a single-frame 3D pose estimator on this data improves accuracy on both real-world and mocap data by evaluating on the 3DPW and HumanEVA datasets.
Delving Deep Into Hybrid Annotations for 3D Human Recovery in the Wild
TLDR
This work focuses on the challenging task of in-the-wild 3D human recovery from single images when paired 3D annotations are not fully available, and shows that incorporating dense correspondence into in- the- wild 3Dhuman recovery is promising and competitive due to its high efficiency and relatively low annotating cost.
TexturePose: Supervising Human Mesh Estimation With Texture Consistency
TLDR
This work proposes a natural form of supervision, that capitalizes on the appearance constancy of a person among different frames (or viewpoints) and achieves state-of-the-art results among model-based pose estimation approaches in different benchmarks.
Self-supervised Learning of Motion Capture
TLDR
This work proposes a learning based motion capture model that optimizes neural network weights that predict 3D shape and skeleton configurations given a monocular RGB video and shows that the proposed model improves with experience and converges to low-error solutions where previous optimization methods fail.
A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation
  • Anyi Rao, Linning Xu, Dahua Lin
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
TLDR
This work builds a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies, and proposes a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie.
...
1
2
3
4
5
...