Holistic 3D Scene Understanding from a Single Image with Implicit Representation

@article{Zhang2021Holistic3S,
  title={Holistic 3D Scene Understanding from a Single Image with Implicit Representation},
  author={Cui-cui Zhang and Zhaopeng Cui and Yinda Zhang and Bing Zeng and Marc Pollefeys and Shuaicheng Liu},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={8829-8838}
}
We present a new pipeline for holistic 3D scene understanding from a single image, which could predict object shapes, object poses, and scene layout. As it is a highly ill-posed problem, existing methods usually suffer from inaccurate estimation of both shapes and layout especially for the cluttered scene due to the heavy occlusion between objects. We propose to utilize the latest deep implicit representation to solve this challenge. We not only propose an image-based local structured implicit… 
From Points to Multi-Object 3D Reconstruction
TLDR
A key-point detector that localizes objects as center points and directly predicts all object properties, including 9-DoF bounding boxes and 3D shapes – all in a single forward pass is proposed, which enables a lightweight reconstruction of realistic and visually-pleasing shapes based on CAD-models.
Joint 3D Reconstruction of Human and Objects in a Dynamic Scene using Monocular Video
Research Clusters: Research Themes: Highlight which of the Academy’s CLUSTERS this project will address? (Please nominate JUST one. For more information, see www.iitbmonash.org) Highlight which of
DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization
TLDR
A novel method for panoramic 3D scene understanding which recovers the 3D room layout and the shape, pose, position, and semantic category for each object from a single full-view panorama image is proposed.
Neural Rendering in a Room: Amodal 3D Understanding and Free-Viewpoint Rendering for the Closed Scene Composed of Pre-Captured Objects
TLDR
The experiments demonstrate that the two-stage design achieves robust 3D scene understanding and outperforms competing methods by a large margin, and it is shown that the realistic free-viewpoint rendering enables various applications, including scene touring and editing.
RangeUDF: Semantic Surface Reconstruction from 3D Point Clouds
TLDR
The key to the approach is a range-aware unsigned distance function together with a surface-oriented semantic segmentation module that demonstrates superior generalization capability across multiple unseen datasets, which is nearly impossible for all existing approaches.
Point Scene Understanding via Disentangled Instance Mesh Reconstruction
TLDR
This work proposes a DIMR framework that leverages a mesh-aware latent code space to disentangle the processes of shape completion and mesh generation, relieving the ambiguity caused by the incomplete point observations.
GPV-Pose: Category-level Object Pose Estimation via Geometry-guided Point-wise Voting
TLDR
GPV-Pose is proposed, a novel framework for robust categorylevel pose estimation, harnessing geometric insights to enhance the learning of category-level pose-sensitive features, and introduces a decoupled confidence-driven rotation representation, which allows geometry-aware recovery of the associated rotation matrix.
Neural Message Passing for Objective-Based Uncertainty Quantification and Optimal Experimental Design
TLDR
This work demonstrates for the first time that one can design accurate surrogate models for efficient objective-UQ via MOCU based on a data-driven approach, and adopts a neural message passing model for surrogate modeling that incorporates a novel axiomatic constraint loss that penalizes an increase in the estimated system uncertainty.
Human-Aware Object Placement for Visual Environment Reconstruction
TLDR
This work demonstrates that human-scene interactions (HSIs) can be leveraged to improve the 3D reconstruction of a scene from a monocular RGB video, and shows that the scene reconstruction can be used to refine the initial 3D human pose and shape (HPS) estimation.
CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation
TLDR
This paper presents a simple onestage approach to predict both the 3D shape and estimate the 6D pose and size jointly in a bounding-box free manner and significantly outperforms all shape completion and categorical 6D poses and size estimation baselines on multi-object ShapeNet and NOCS datasets respectively.
...
...

References

SHOWING 1-10 OF 59 REFERENCES
Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes From a Single Image
TLDR
This paper proposes an end-to-end solution to jointly reconstruct room layout, object bounding boxes and meshes from a single image, and argues that understanding the context of each component can assist the task of parsing the others, which enables joint understanding and reconstruction.
SceneCAD: Predicting Object Alignments and Layouts in RGB-D Scans
TLDR
A message-passing graph neural network is proposed to model the inter-relationships between objects and layout, guiding generation of a globally object alignment in a scene by considering the global scene layout.
Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation
TLDR
An end-to-end model that simultaneously solves all three tasks in real-time given only a single RGB image and significantly outperforms prior approaches on 3D object detection, 3D layout estimation,3D camera pose estimation, and holistic scene understanding is proposed.
Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image
TLDR
A Holistic Scene Grammar (HSG) is introduced to represent the 3D scene structure, which characterizes a joint distribution over the functional and geometric space of indoor scenes, and significantly outperforms prior methods on 3D layout estimation, 3D object detection, and holistic scene understanding.
Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense
We propose a new 3D holistic++ scene understanding problem, which jointly tackles two tasks from a single-view image: (i) holistic scene parsing and reconstruction---3D estimations of object bounding
SeeThrough: Finding Chairs in Heavily Occluded Indoor Scene Images
TLDR
This work uses a neural network trained on real indoor annotated images to extract 2D keypoints, and solves a global selection problem among 3D candidates using pairwise co-occurrence statistics discovered from a large 3D scene database.
Understanding Indoor Scenes Using 3D Geometric Phrases
TLDR
A hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification is presented.
Local Deep Implicit Functions for 3D Shape
TLDR
Local Deep Implicit Functions (LDIF), a 3D shape representation that decomposes space into a structured set of learned implicit functions that provides higher surface reconstruction accuracy than the state-of-the-art (OccNet), while requiring fewer than 1\% of the network parameters.
Learning Efficient Point Cloud Generation for Dense 3D Object Reconstruction
TLDR
This paper uses 2D convolutional operations to predict the 3D structure from multiple viewpoints and jointly apply geometric reasoning with 2D projection optimization, and introduces the pseudo-renderer, a differentiable module to approximate the true rendering operation, to synthesize novel depth maps for optimization.
DOPS: Learning to Detect 3D Objects and Predict Their 3D Shapes
TLDR
The core novelty of the DOPS method is a fast, single-pass architecture that both detects objects in 3D and estimates their shapes, and is able to extract shapes without access to ground-truth shape information in the target dataset.
...
...