• Corpus ID: 242757047

Panoptic 3D Scene Reconstruction From a Single RGB Image

  title={Panoptic 3D Scene Reconstruction From a Single RGB Image},
  author={Manuel Dahnert and Ji Hou and Matthias Nie{\ss}ner and Angela Dai},
Understanding 3D scenes from a single image is fundamental to a wide variety of tasks, such as for robotics, motion planning, or augmented reality. Existing works in 3D perception from a single RGB image tend to focus on geometric reconstruction only, or geometric reconstruction with semantic segmentation or instance segmentation. Inspired by 2D panoptic segmentation, we propose to unify the tasks of geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation into the task… 

Figures and Tables from this paper

Learning 3D Scene Priors with 2D Supervision

This work proposes a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth, and achieves state-of-the-art results in scene synthesis against baselines which require for 3D supervision.

Panoptic Lifting for 3D Scene Understanding with Neural Fields

We propose Panoptic Lifting, a novel approach for learning panoptic 3D volumetric representations from images of in-the-wild scenes. Once trained, our model can render color images together with

SceneRF: Self-Supervised Monocular 3D Scene Reconstruction with Radiance Fields

SceneRF, a self-supervised monocular scene reconstruction method with neural radiance fields (NeRF) learned from multiple image sequences with pose is proposed, and new geometry constraints and a novel probabilistic sampling strategy are introduced to improve geometry prediction.

MonoScene: Monocular 3D Semantic Scene Completion

Experiments show the MonoScene framework outperform the literature on all metries and datasets while hallucinating plausible scenery even beyond the camera field of view, and introduces a 3D context relation prior to enforce spatio-semantic consistency.

Neural RGB-D Surface Reconstruction

This work proposes to represent the surface using an implicit function (truncated signed distance function), and shows how to incorporate this representation in the NeRF framework, and extend it to use depth measurements from a commodity RGB-D sensor, such as a Kinect.

AutoRF: Learning 3D Object Radiance Fields from Single View Observations

It is shown that the AutoRF method generalizes well to unseen objects, even across different datasets of challenging real-world street scenes such as nuScenes, KITTI, and Mapillary Metropolis.

Joint stereo 3D object detection and implicit surface reconstruction

This approach features a new instance-level network that explicitly models the unseen surface hallucination problem using point-based representations and uses a new geometric representation for orientation refinement.

Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation

Panoptic Neural Fields is presented, an object-aware neural scene representation that decomposes a scene into a set of objects (things) and background (stuff) that can be smaller and faster than previous object- aware approaches, while still leveraging category-specific priors incorporated via meta-learned initialization.

Neural rendering in a room

A novel solution to mimic such human perception capability based on a new paradigm of amodal 3D scene understanding with neural rendering for a closed scene by exploiting compositional neural rendering techniques for data augmentation in the offline training.

3D Multi-Object Tracking with Differentiable Pose Estimation

A graph-based, fully end-to-end-learnable approach for joint 3D multi-object tracking and reconstruction from RGB-D sequences in indoor environments that improves the accumulated MOTA score for all test sequences by 24.8% over existing state-of-the-art methods.



Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image

A Holistic Scene Grammar (HSG) is introduced to represent the 3D scene structure, which characterizes a joint distribution over the functional and geometric space of indoor scenes, and significantly outperforms prior methods on 3D layout estimation, 3D object detection, and holistic scene understanding.

3D Scene Reconstruction With Multi-Layer Depth and Epipolar Transformers

To improve the accuracy of view-centered representations for complex scenes, this work introduces a novel "Epipolar Feature Transformer" that transfers convolutional network features from an input view to other virtual camera viewpoints, and thus better covers the 3D scene geometry.

3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans

3D-SIS is introduced, a novel neural network architecture for 3D semantic instance segmentation in commodity RGB-D scans that leverages high-resolution RGB input by associating 2D images with the volumetric grid based on the pose alignment of the 3D reconstruction.

CoReNet: Coherent 3D scene reconstruction from a single RGB image

The model is adapted to address the harder task of reconstructing multiple objects from a single image, producing a coherent reconstruction, where all objects live in a single consistent 3D coordinate frame relative to the camera and they do not intersect in 3D space.

3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation

3DMV is presented, a novel method for 3D semantic scene segmentation of RGB-D scans in indoor environments using a joint 3D-multi-view prediction network that achieves significantly better results than existing baselines.

3D Scene Reconstruction from a Single Viewport

A novel approach to infer volumetric reconstructions from a single viewport, based only on an RGB image and a reconstructed normal image, and introduces a novel loss shaping technique for 3D data that guides the learning process towards regions where free and occupied space are close to each other.

Semantic Scene Completion from a Single Depth Image

The semantic scene completion network (SSCNet) is introduced, an end-to-end 3D convolutional network that takes a single depth image as input and simultaneously outputs occupancy and semantic labels for all voxels in the camera view frustum.

Mask2CAD: 3D Shape Prediction by Learning to Segment and Retrieve

Mask2CAD is presented, which jointly detects objects in real-world images and for each detected object, optimizes for the most similar CAD model and its pose, and constructs a joint embedding space between the detected regions of an image corresponding to an object and 3D CAD models, enabling retrieval of CAD models for an input RGB image.

RevealNet: Seeing Behind Objects in RGB-D Scans

RevealNet is a new data-driven approach that jointly detects object instances and predicts their complete geometry, which enables a semantically meaningful decomposition of a scanned scene into individual, complete 3D objects, including hidden and unobserved object parts.

PerspectiveNet: 3D Object Detection from a Single RGB Image via Perspective Points

Detecting 3D objects from a single RGB image is intrinsically ambiguous, thus requiring appropriate prior knowledge and intermediate representations as constraints to reduce the uncertainties and