• Corpus ID: 235670279

Benchmarking Unsupervised Object Representations for Video Sequences

  title={Benchmarking Unsupervised Object Representations for Video Sequences},
  author={Marissa A. Weis and Kashyap Chitta and Yash Sharma and Wieland Brendel and Matthias Bethge and Andreas Geiger and Alexander S. Ecker},
  journal={J. Mach. Learn. Res.},
Perceiving the world in terms of objects and tracking them through time is a crucial prerequisite for reasoning and scene understanding. Recently, several methods have been proposed for unsupervised learning of object-centric representations. However, since these models were evaluated on different downstream tasks, it remains unclear how they compare in terms of basic perceptual abilities such as detection, figure-ground segmentation and tracking of objects. To close this gap, we design a… 
Compositional Scene Representation Learning via Reconstruction: A Survey
This survey first outlines the current progress on this research topic, including development history and categorizations of existing methods from the perspectives of modeling of visual scenes and inference of scene representations; then provides benchmarks, including an open source toolbox to reproduce the benchmark experiments.
Unsupervised Image Decomposition with Phase-Correlation Networks
The Phase-Correlation Decomposition Network (PCDNet), a novel model that decomposes a scene into its object components, which are represented as transformed versions of a set of learned object prototypes, is proposed.
Unsupervised Learning of Compositional Scene Representations from Multiple Unspecified Viewpoints
A novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision is considered, and a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoints-dependent part is proposed to solve this problem.


SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition
A generative latent variable model, called SPACE, is proposed that provides a unified probabilistic modeling framework that combines the best of spatial-attention and scene-mixture approaches and resolves the scalability problems of previous methods.
Multi-Object Representation Learning with Iterative Variational Inference
This work argues for the importance of learning to segment and represent objects jointly, and demonstrates that, starting from the simple assumption that a scene is composed of multiple entities, it is possible to learn to segment images into interpretable objects with disentangled representations.
MONet: Unsupervised Scene Decomposition and Representation
The Multi-Object Network (MONet) is developed, which is capable of learning to decompose and represent challenging 3D scenes into semantically meaningful components, such as objects and background elements.
Unsupervised object-centric video generation and decomposition in 3D
This work proposes to model a video as the view seen while moving through a scene with multiple 3D objects and a 3D background, and evaluates its method on depth-prediction and 3D object detection and shows it out-performs them even on 2D instance segmentation and tracking.
Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers
A Tracking-by-Animation framework, where a differentiable neural model first tracks objects from input frames and then animates these objects into reconstructed frames to achieve both label-free and end-to-end learning of MOT.
Tagger: Deep Unsupervised Perceptual Grouping
This work presents a framework for efficient perceptual inference that explicitly reasons about the segmentation of its inputs and features and greatly improves on the semi-supervised result of a baseline Ladder network on the authors' dataset, indicating that segmentation can also improve sample efficiency.
GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations
Generative latent-variable models are emerging as promising tools in robotics and reinforcement learning. Yet, even though tasks in these domains typically involve distinct objects, most
Spatially Invariant Unsupervised Object Detection with Convolutional Neural Networks
A neural network architecture that is able to discover and detect objects in large, many-object scenes, and has a significant ability to generalize to images that are larger and contain more objects than images encountered during training; and that it has enough accuracy to facilitate non-trivial downstream processing.
Towards causal generative scene models via competition of experts
This work presents an alternative approach which uses an inductive bias encouraging modularity by training an ensemble of generative models (experts) and allows for controllable sampling of individual objects and recombination of experts in physically plausible ways.
Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects
SQAIR is an interpretable deep generative model for image sequences that can reliably discover and track objects through the sequence; it can also conditionally generate future frames, thereby simulating expected motion of objects.