Robust and Controllable Object-Centric Learning through Energy-based Models

  title={Robust and Controllable Object-Centric Learning through Energy-based Models},
  author={Ruixiang Zhang and Tong Che and B. Ivanovic and Renhao Wang and Marco Pavone and Yoshua Bengio and Liam Paull},
Humans are remarkably good at understanding and reasoning about complex visual scenes. The capability to decompose low-level observations into discrete objects allows us to build a grounded abstract representation and identify the compositional structure of the world. Accordingly, it is a crucial step for machine learning models to be capable of inferring objects and their properties from visual scenes without explicit supervision. However, existing works on objectcentric representation… 



Conditional Object-Centric Learning from Video

Using the temporal dynamics of video data in the form of optical flow and conditioning the model on simple object location cues can be used to enable segmenting and tracking objects in significantly more realistic synthetic data, which could pave the way for a range of weakly-supervised approaches and allow more effective interaction with trained models.

Multi-Object Representation Learning with Iterative Variational Inference

This work argues for the importance of learning to segment and represent objects jointly, and demonstrates that, starting from the simple assumption that a scene is composed of multiple entities, it is possible to learn to segment images into interpretable objects with disentangled representations.

Generalization and Robustness Implications in Object-Centric Learning

This paper trains state-of-the-art unsupervised models on five common multi-object datasets and evaluates segmentation accuracy and downstream object property prediction and finds object-centric representations to be generally useful for downstream tasks and robust to shifts in the data distribution.

GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations

Generative latent-variable models are emerging as promising tools in robotics and reinforcement learning. Yet, even though tasks in these domains typically involve distinct objects, most

Attend, Infer, Repeat: Fast Scene Understanding with Generative Models

We present a framework for efficient inference in structured image models that explicitly reason about objects. We achieve this by performing probabilistic inference using a recurrent neural network

Object-Centric Learning with Slot Attention

An architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention is presented.

GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement

This work proposes an embedding-based approach in which embeddings of pixels are clustered in a differentiable fashion using a stochastic stick-breaking process to develop a new model, GENESIS-V2, which can infer a variable number of object representations without using RNNs or iterative refinement.

SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

A generative latent variable model, called SPACE, is proposed that provides a unified probabilistic modeling framework that combines the best of spatial-attention and scene-mixture approaches and resolves the scalability problems of previous methods.

Towards Self-Supervised Learning of Global and Object-Centric Representations

This work shows that contrastive losses equipped with matching can be applied directly in a latent space, avoiding pixel-based reconstruction and discusses key aspects of learning structured object-centric representations with self-supervision.

Online Object Representations with Contrastive Learning

A self-supervised approach for learning representations of objects from monocular videos is proposed and found that given a limited set of objects, object correspondences will naturally emerge when using contrastive learning without requiring explicit positive pairs.