DeepContext: Context-Encoding Neural Pathways for 3D Holistic Scene Understanding

  title={DeepContext: Context-Encoding Neural Pathways for 3D Holistic Scene Understanding},
  author={Yinda Zhang and Mingru Bai and Pushmeet Kohli and Shahram Izadi and Jianxiong Xiao},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
3D context has been shown to be extremely important for scene understanding, yet very little research has been done on integrating context information with deep neural network architectures. This paper presents an approach to embed 3D context into the topology of a neural network trained to perform holistic scene understanding. Given a depth image depicting a 3D scene, our network aligns the observed scene with a predefined 3D scene template, and then reasons about the existence and location of… 

Figures and Tables from this paper

DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization
A novel method for panoramic 3D scene understanding which recovers the 3D room layout and the shape, pose, position, and semantic category for each object from a single full-view panorama image is proposed.
Physically-Based Rendering for Indoor Scene Understanding Using Convolutional Neural Networks
This work introduces a large-scale synthetic dataset with 500K physically-based rendered images from 45K realistic 3D indoor scenes and shows that pretraining with this new synthetic dataset can improve results beyond the current state of the art on all three computer vision tasks.
Hand3D: Hand Pose Estimation using 3D Neural Network
A novel 3D neural network architecture for 3D hand pose estimation from a single depth image that converts the depth map to a 3D volumetric representation, and feeds it into a3D convolutional neural network to directly produce the pose in 3D requiring no further process.
Scene Structure Inference through Scene Map Estimation
This paper proposes the concept of a scene map, a coarse scene representation, which describes the locations of the objects present in the scene from a top-down view (i.e., as they are positioned on the floor), as well as a pipeline to extract such a map from a single RGB image.
Image Annotation based on Deep Hierarchical Context Networks
DHCN is introduced: a novel Deep Hierarchical Context Network that leverages different sources of contexts including geometric and semantic relationships that solves the representation learning problem by training its underlying deep network whose parameters correspond to the most influencing bi-level contextual relationships.
Complete 3D Scene Parsing from an RGBD Image
This paper's representation encodes the layout of orthogonal walls and the extent of objects, modeled with CAD-like 3D shapes, and proposes a retrieval scheme that uses convolutional neural networks to classify regions and retrieve objects with similar shapes.
Cuboids Revisited: Learning Robust 3D Shape Fitting to Single RGB Images
This work proposes a robust estimator for primitive fitting, which can meaningfully abstract real-world environments using cuboids and does not require labour-intensive labels, such as cuboid annotations, for training.
Hierarchy Denoising Recursive Autoencoders for 3D Scene Layout Prediction
A variational denoising recursive autoencoder that generates and iteratively refines a hierarchical representation of 3D object layouts, interleaving bottom-up encoding for context aggregation and top-down decoding for propagation.
3DRM: Pair-wise relation module for 3D object detection


SceneNet: Understanding Real World Indoor Scenes With Synthetic Data
This work focuses its attention on depth based semantic per-pixel labelling as a scene understanding problem and shows the potential of computer graphics to generate virtually unlimited labelled data from synthetic 3D scenes by carefully synthesizing training data with appropriate noise models.
Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views
A scalable and overfit-resistant image synthesis pipeline, together with a novel CNN specifically tailored for the viewpoint estimation task, is proposed that can significantly outperform state-of-the-art methods on PASCAL 3D+ benchmark.
PanoContext: A Whole-Room 3D Context Model for Panoramic Scene Understanding
Experiments show that solely based on 3D context without any image region category classifier, the proposed whole-room context model can achieve a comparable performance with the state-of-the-art object detector, demonstrating that when the FOV is large, context is as powerful as object appearance.
Holistic Scene Understanding for 3D Object Detection with RGBD Cameras
A holistic approach that exploits 2D segmentation, 3D geometry, as well as contextual relations between scenes and objects, and develops a conditional random field to integrate information from different sources to classify the cuboids is proposed.
3D ShapeNets for 2.5D Object Recognition and Next-Best-View Prediction
This work proposes to represent a geometric 3D shape as a probability distribution of binary variables on a 3D voxel grid, using a Convolutional Deep Belief Network, and naturally supports object recognition from 2.5D depth map and also view planning for object recognition.
3D ShapeNets: A deep representation for volumetric shapes
This work proposes to represent a geometric 3D shape as a probability distribution of binary variables on a 3D voxel grid, using a Convolutional Deep Belief Network, and shows that this 3D deep representation enables significant performance improvement over the-state-of-the-arts in a variety of tasks.
SUN RGB-D: A RGB-D scene understanding benchmark suite
This paper introduces an RGB-D benchmark suite for the goal of advancing the state-of-the-arts in all major scene understanding tasks, and presents a dataset that enables the train data-hungry algorithms for scene-understanding tasks, evaluate them using meaningful 3D metrics, avoid overfitting to a small testing set, and study cross-sensor bias.
Toward Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
This work proposes Feedback Enabled Cascaded Classification Models (FE-CCM), a two-layer cascade of classifiers that jointly optimizes all the subtasks while requiring only a “black box” interface to the original classifier for each subtask.
Joint embeddings of shapes and images via CNN image purification
A joint embedding space populated by both 3D shapes and 2D images of objects, where the distances between embedded entities reflect similarity between the underlying objects, which facilitates comparison between entities of either form, and allows for cross-modality retrieval.
Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images
  • S. Song, Jianxiong Xiao
  • Computer Science
    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2016
This work proposes the first 3D Region Proposal Network (RPN) to learn objectness from geometric shapes and the first joint Object Recognition Network (ORN) to extract geometric features in 3D and color features in 2D.