Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense

  title={Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense},
  author={Yixin Chen and Siyuan Huang and Tao Yuan and Siyuan Qi and Yixin Zhu and Song-Chun Zhu},
  journal={2019 IEEE/CVF International Conference on Computer Vision (ICCV)},
We propose a new 3D holistic++ scene understanding problem, which jointly tackles two tasks from a single-view image: (i) holistic scene parsing and reconstruction---3D estimations of object bounding boxes, camera pose, and room layout, and (ii) 3D human pose estimation. The intuition behind is to leverage the coupled nature of these two tasks to improve the granularity and performance of scene understanding. We propose to exploit two critical and essential connections between these two tasks… 

Figures and Tables from this paper

3DP3: 3D Scene Perception via Probabilistic Programming
3DP3 enables scene understanding that is aware of 3D shape, occlusion, and contact structure and is more accurate at 6DoF object pose estimation from real images than deep learning baselines.
Holistic 3D Human and Scene Mesh Estimation from Single View Images
  • Zhenzhen Weng, S. Yeung
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
This work pro-poses a holistically trainable model that perceives the 3D scene from a single RGB image, estimates the camera pose and the room layout, and reconstructs both human body and object meshes and is the first model that outputs both object and human predictions at the mesh level, and performs joint optimization on the scene and human poses.
Towards High-Fidelity Single-view Holistic Reconstruction of Indoor Scenes
This work proposes an instance-aligned implicit function (InstPIFu) for detailed object reconstruction for holistic 3D indoor scenes including both room background and indoor objects from single-view images and outperforms existing approaches in both background and foreground object reconstruction.
Single-Shot Scene Reconstruction
A novel scene reconstruction method to infer a fully editable and re-renderable model of a 3D road scene from a single image and it is shown that this reconstruction can be used in an analysisby-synthesis setting via differentiable rendering.
CHORE: Contact, Human and Object REconstruction from a single RGB image
This work introduces CHORE, a novel method that learns to jointly reconstruct human and object from a single image that significantly outperforms the SOTA and proposes a simple yet effective depth-aware scaling that allows more efficient shape learning on real data.
Populating 3D Scenes by Learning Human-Scene Interaction
A novel Human-Scene Interaction model that encodes proximal relationships, called POSA for "Pose with prOximitieS and contActs", and shows that its learned representation of body-scene interaction supports monocular human pose estimation that is consistent with a 3D scene, improving on the state of the art.
DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization
A novel method for panoramic 3D scene understanding which recovers the 3D room layout and the shape, pose, position, and semantic category for each object from a single full-view panorama image is proposed.
D3D-HOI: Dynamic 3D Human-Object Interactions from Videos
This work introduces D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions, demonstrating that human- object relations can significantly reduce the ambiguity of articulated object reconstructions from challenging real-world videos.
Generating 3D People in Scenes Without People
The approach is able to synthesize realistic and expressive 3D human bodies that naturally interact with 3D environment that will be useful for numerous applications; e.g. to generate training data for human pose estimation, in video games and in VR/AR.
Reconstructing Interactive 3D Scenes by Panoptic Mapping and CAD Model Alignments
This paper reconstructs an interactive scene using RGB-D data stream, which captures the semantics and geometry of objects and layouts by a 3D volumetric panoptic mapping module, and object affordance and contextual relations by reasoning over physical common sense among objects, organized by a graph-based scene representation.


Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes: The Importance of Multiple Scene Constraints
This paper leverage state-of-the-art deep multi-task neural networks and parametric human and scene modeling, towards a fully automatic monocular visual sensing system for multiple interacting people, which infers the 2d and 3d pose and shape of multiple people from a single image.
Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image
A Holistic Scene Grammar (HSG) is introduced to represent the 3D scene structure, which characterizes a joint distribution over the functional and geometric space of indoor scenes, and significantly outperforms prior methods on 3D layout estimation, 3D object detection, and holistic scene understanding.
Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation
An end-to-end model that simultaneously solves all three tasks in real-time given only a single RGB image and significantly outperforms prior approaches on 3D object detection, 3D layout estimation,3D camera pose estimation, and holistic scene understanding is proposed.
Semantic Scene Completion from a Single Depth Image
The semantic scene completion network (SSCNet) is introduced, an end-to-end 3D convolutional network that takes a single depth image as input and simultaneously outputs occupancy and semantic labels for all voxels in the camera view frustum.
Complete 3D Scene Parsing from Single RGBD Image
This paper aims to predict the full 3D parse of both visible and occluded portions of the scene from one RGBD image, and proposes a retrieval scheme that uses convolutional neural networks to classify regions and retrieve objects with similar shapes.
Human-Centric Indoor Scene Synthesis Using Stochastic Grammar
We present a human-centric method to sample and synthesize 3D room layouts and 2D images thereof, to obtain large-scale 2D/3D image data with the perfect per-pixel ground truth. An attributed spatial
Action-driven 3D indoor scene evolution
We introduce a framework for action-driven evolution of 3D indoor scenes, where the goal is to simulate how scenes are altered by human actions, and specifically, by object placements necessitated by
Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics
An approach for scene understanding by reasoning physical stability of objects from point cloud by using a novel disconnectivity graph (DG) to represent the energy landscape and using a Swendsen-Wang Cut (MCMC) method for optimization.
SceneGrok: inferring action maps in 3D environments
This paper uses RGB-D sensors to capture dense 3D reconstructions of real-world scenes, and trains a classifier which can transfer interaction knowledge to unobserved 3D scenes and demonstrates prediction of action maps in both 3D scans and virtual scenes.
SUN RGB-D: A RGB-D scene understanding benchmark suite
This paper introduces an RGB-D benchmark suite for the goal of advancing the state-of-the-arts in all major scene understanding tasks, and presents a dataset that enables the train data-hungry algorithms for scene-understanding tasks, evaluate them using meaningful 3D metrics, avoid overfitting to a small testing set, and study cross-sensor bias.