Unsupervised Image Representation Learning with Deep Latent Particles

  title={Unsupervised Image Representation Learning with Deep Latent Particles},
  author={Tal Daniel and Aviv Tamar},
  booktitle={International Conference on Machine Learning},
We propose a new representation of visual data that disentangles object position from appearance. Our method, termed Deep Latent Particles (DLP), decomposes the visual input into low-dimensional latent “particles”, where each particle is described by its spatial location and features of its surrounding region. To drive learning of such representations, we follow a VAE-based approach and introduce a prior for particle positions based on a spatial-softmax architecture, and a modification of the… 



Unsupervised Discovery of Object Landmarks as Structural Representations

This paper proposes an autoencoding formulation to discover landmarks as explicit structural representations, which naturally creates an unsupervised, perceptible interface to manipulate object shapes and decode images with controllable structures.

Unsupervised Learning of Object Keypoints for Perception and Control

Transporter is introduced, a neural network architecture for discovering concise geometric object representations in terms of keypoints or image-space coordinates that helps track objects and object parts across long time-horizons more accurately than recent similar methods.

Multi-Object Representation Learning with Iterative Variational Inference

This work argues for the importance of learning to segment and represent objects jointly, and demonstrates that, starting from the simple assumption that a scene is composed of multiple entities, it is possible to learn to segment images into interpretable objects with disentangled representations.

Unsupervised Learning of Object Landmarks through Conditional Image Generation

This work proposes a method for learning landmark detectors for visual objects (such as the eyes and the nose in a face) without any manual supervision and introduces a tight bottleneck in the geometry-extraction process that selects and distils geometry-related features.

Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos

The proposed factorization results in landmarks that are focused on the foreground object of interest when measured against ground-truth foreground masks, and the rendered background quality is improved as ill-suited landmarks are no longer forced to model this content.

GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement

This work proposes an embedding-based approach in which embeddings of pixels are clustered in a differentiable fashion using a stochastic stick-breaking process to develop a new model, GENESIS-V2, which can infer a variable number of object representations without using RNNs or iterative refinement.

Unsupervised Learning of Object Landmarks by Factorized Spatial Embeddings

This paper proposes a novel unsupervised approach that can discover and learn landmarks in object categories, thus characterizing their structure, and shows that the learned landmarks establish meaningful correspondences between different object instances in a category without having to impose this requirement explicitly.

Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects

SQAIR is an interpretable deep generative model for image sequences that can reliably discover and track objects through the sequence; it can also conditionally generate future frames, thereby simulating expected motion of objects.

Unsupervised Learning of Object Structure and Dynamics from Videos

A keypoint-based image representation is adopted and a stochastic dynamics model of the keypoints is learned that outperforms unstructured representations on a range of motion-related tasks such as object tracking, action recognition and reward prediction.

Unsupervised learning of object frames by dense equivariant image labelling

A new approach is proposed that, given a large number of images of an object and no other supervision, can extract a dense object-centric coordinate frame that is invariant to deformations of the images and comes with a dense equivariant labelling neural network that can map image pixels to their corresponding object coordinates.