• Corpus ID: 250627427

Spatially Invariant Unsupervised 3D Object-Centric Learning and Scene Decomposition

  title={Spatially Invariant Unsupervised 3D Object-Centric Learning and Scene Decomposition},
  author={Tianyu Wang and Miaomiao Liu and Kee Siong Ng},
. We tackle the problem of object-centric learning on point clouds, which is crucial for high-level relational reasoning and scalable machine intelligence. In particular, we introduce a framework, SPAIR3D , to factorize a 3D point cloud into a spatial mixture model where each component corresponds to one object. To model the spatial mixture model on point clouds, we derive the Chamfer Mixture Loss , which fits naturally into our variational training pipeline. Moreover, we adopt an object… 

Figures and Tables from this paper



Decomposing 3D Scenes into Objects via Unsupervised Volume Segmentation

We present ObSuRF, a method which turns a single image of a scene into a 3D model represented as a set of Neural Radiance Fields (NeRFs), with each NeRF corresponding to a different object. A single

ROOTS: Object-Centric Representation and Rendering of 3D Scenes

A probabilistic generative model for learning to build modular and compositional 3D object models from partial observations of a multi-object scene is proposed and it is demonstrated that the learned representation permits object-wise manipulation and novel scene generation, and generalizes to various settings.

SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

A generative latent variable model, called SPACE, is proposed that provides a unified probabilistic modeling framework that combines the best of spatial-attention and scene-mixture approaches and resolves the scalability problems of previous methods.

Unsupervised object-centric video generation and decomposition in 3D

This work proposes to model a video as the view seen while moving through a scene with multiple 3D objects and a 3D background, and evaluates its method on depth-prediction and 3D object detection and shows it out-performs them even on 2D instance segmentation and tracking.

Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views

The Multi-View and Multi-Object Network (MulMON) is proposed—a method for learning accurate, object-centric representations of multi-object scenes by leveraging multiple views and better-resolves spatial ambiguities than single-view methods—learning more accurate and disentangled object representations.

PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation

This paper designs a two-branch network to extract point features and predict semantic labels and offsets, for shifting each point towards its respective instance centroid, and presents PointGroup, a new end-to-end bottom-up architecture specifically focused on better grouping the points by exploring the void space between objects.

PointConv: Deep Convolutional Networks on 3D Point Clouds

The dynamic filter is extended to a new convolution operation, named PointConv, which can be applied on point clouds to build deep convolutional networks and is able to achieve state-of-the-art on challenging semantic segmentation benchmarks on 3D point clouds.

Learning Representations and Generative Models for 3D Point Clouds

A deep AutoEncoder network with state-of-the-art reconstruction quality and generalization ability is introduced with results that outperform existing methods on 3D recognition tasks and enable shape editing via simple algebraic manipulations.

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

A dataset of large-scale indoor spaces that provides a variety of mutually registered modalities from 2D, 2.5D and 3D domains, with instance-level semantic and geometric annotations, enables development of joint and cross-modal learning models and potentially unsupervised approaches utilizing the regularities present in large- scale indoor spaces.

Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking

An architecture that scales well to the large-scene, many-object setting by employing spatially invariant computations (convolutions and spatial attention) and representations (a spatially local object specification scheme) is proposed.