• Corpus ID: 250627427

Spatially Invariant Unsupervised 3D Object-Centric Learning and Scene Decomposition

@inproceedings{Wang2021SpatiallyIU,
  title={Spatially Invariant Unsupervised 3D Object-Centric Learning and Scene Decomposition},
  author={Tianyu Wang and Miaomiao Liu and Kee Siong Ng},
  year={2021}
}
. We tackle the problem of object-centric learning on point clouds, which is crucial for high-level relational reasoning and scalable machine intelligence. In particular, we introduce a framework, SPAIR3D , to factorize a 3D point cloud into a spatial mixture model where each component corresponds to one object. To model the spatial mixture model on point clouds, we derive the Chamfer Mixture Loss , which fits naturally into our variational training pipeline. Moreover, we adopt an object… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 37 REFERENCES

Decomposing 3D Scenes into Objects via Unsupervised Volume Segmentation

We present ObSuRF, a method which turns a single image of a scene into a 3D model represented as a set of Neural Radiance Fields (NeRFs), with each NeRF corresponding to a different object. A single

ROOTS: Object-Centric Representation and Rendering of 3D Scenes

TLDR
A probabilistic generative model for learning to build modular and compositional 3D object models from partial observations of a multi-object scene is proposed and it is demonstrated that the learned representation permits object-wise manipulation and novel scene generation, and generalizes to various settings.

SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

TLDR
A generative latent variable model, called SPACE, is proposed that provides a unified probabilistic modeling framework that combines the best of spatial-attention and scene-mixture approaches and resolves the scalability problems of previous methods.

Unsupervised object-centric video generation and decomposition in 3D

TLDR
This work proposes to model a video as the view seen while moving through a scene with multiple 3D objects and a 3D background, and evaluates its method on depth-prediction and 3D object detection and shows it out-performs them even on 2D instance segmentation and tracking.

Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views

TLDR
The Multi-View and Multi-Object Network (MulMON) is proposed—a method for learning accurate, object-centric representations of multi-object scenes by leveraging multiple views and better-resolves spatial ambiguities than single-view methods—learning more accurate and disentangled object representations.

PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation

TLDR
This paper designs a two-branch network to extract point features and predict semantic labels and offsets, for shifting each point towards its respective instance centroid, and presents PointGroup, a new end-to-end bottom-up architecture specifically focused on better grouping the points by exploring the void space between objects.

PointConv: Deep Convolutional Networks on 3D Point Clouds

TLDR
The dynamic filter is extended to a new convolution operation, named PointConv, which can be applied on point clouds to build deep convolutional networks and is able to achieve state-of-the-art on challenging semantic segmentation benchmarks on 3D point clouds.

Learning Representations and Generative Models for 3D Point Clouds

TLDR
A deep AutoEncoder network with state-of-the-art reconstruction quality and generalization ability is introduced with results that outperform existing methods on 3D recognition tasks and enable shape editing via simple algebraic manipulations.

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

TLDR
A dataset of large-scale indoor spaces that provides a variety of mutually registered modalities from 2D, 2.5D and 3D domains, with instance-level semantic and geometric annotations, enables development of joint and cross-modal learning models and potentially unsupervised approaches utilizing the regularities present in large- scale indoor spaces.

Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking

TLDR
An architecture that scales well to the large-scene, many-object setting by employing spatially invariant computations (convolutions and spatial attention) and representations (a spatially local object specification scheme) is proposed.