• Corpus ID: 238743890

Unsupervised Object Learning via Common Fate

  title={Unsupervised Object Learning via Common Fate},
  author={Matthias Tangemann and Steffen Schneider and Julius von K{\"u}gelgen and Francesco Locatello and Peter Gehler and Thomas Brox and Matthias Kummerer and Matthias Bethge and Bernhard Scholkopf},
Learning generative object models from unlabelled videos is a long standing problem and required for causal scene modeling. We decompose this problem into three easier subtasks, and provide candidate solutions for each of them. Inspired by the Common Fate Principle of Gestalt Psychology, we first extract (noisy) masks of moving objects via unsupervised motion segmentation. Second, generative models are trained on the masks of the background and the moving objects, respectively. Third… 

Unsupervised Segmentation in Real-World Images via Spelke Object Inference

This work shows how to learn static grouping priors from motion self-supervision, building on the cognitive science notion of Spelke Objects: groupings of stuff that move together, and introduces Excitatory-Inhibitory Segment Extraction Network (EISEN), which learns from optical flow estimates to extract pairwise affinity graphs for static scenes.

Boosting Object Representation Learning via Motion and Object Continuity

This work proposes to ex-ploit object motion and continuity, i.e, objects do not pop in and out of existence, and shows clear benevolence of integrating motion and object continuity for downstream tasks, moving beyond object representation learning based only on reconstruction.

Discovering Objects that Can Move

This paper simplifies the recent auto-encoder based frameworks for unsuper-vised object discovery, and augment the resulting model with a weak learning signal from general motion segmentation algorithms, which is enough to generalize to segment both moving and static instances of dynamic objects.

Self-supervised Amodal Video Object Segmentation

A novel self-supervised learning paradigm that efficiently utilizes the visible object parts as the supervision to guide the training on videos, which achieves the state-of-the-art performance on the synthetic amodal segmentation benchmark FISHBOWL and the real world benchmark KINS-Video-Car.

TAP-Vid: A Benchmark for Tracking Any Point in a Video

A novel semi-automatic crowdsourced pipeline which uses optical estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video, and proposes a simple end-to-end point tracking model TAP-Net, which outperforms all prior methods on the authors' benchmark when trained on synthetic data.

Bridging the Gap to Real-World Object-Centric Learning

DINOSAUR is the first unsupervised object-centric model that scales to real world-datasets such as COCO and PASCAL VOC and shows competitive performance compared to more involved pipelines from the computer vision literature.

Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos

This paper proposes STEVE, an unsupervised model for object-centric learning in videos that uses a transformer-based image decoder conditioned on slots and the learning objective is simply to reconstruct the observation.



Towards causal generative scene models via competition of experts

This work presents an alternative approach which uses an inductive bias encouraging modularity by training an ensemble of generative models (experts) and allows for controllable sampling of individual objects and recombination of experts in physically plausible ways.

Multi-Object Representation Learning with Iterative Variational Inference

This work argues for the importance of learning to segment and represent objects jointly, and demonstrates that, starting from the simple assumption that a scene is composed of multiple entities, it is possible to learn to segment images into interpretable objects with disentangled representations.

SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

A generative latent variable model, called SPACE, is proposed that provides a unified probabilistic modeling framework that combines the best of spatial-attention and scene-mixture approaches and resolves the scalability problems of previous methods.

Generative Modeling of Infinite Occluded Objects for Compositional Scene Representation

A deep generative model which explicitly models object occlusions for compositional scene representation and outperforms two state-of-the-art methods when object Occlusions exist is presented.

Unsupervised Moving Object Detection via Contextual Information Separation

An adversarial contextual model for detecting moving objects in images that can be thought of as a generalization of classical variational generative region-based segmentation, but in a way that avoids explicit regularization or solution of partial differential equations at run-time.

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

The key hypothesis is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis and a fast and realistic image synthesis model is proposed.

Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects

SQAIR is an interpretable deep generative model for image sequences that can reliably discover and track objects through the sequence; it can also conditionally generate future frames, thereby simulating expected motion of objects.

Learning a Generative Model of Images by Factoring Appearance and Shape

This work introduces a basic model, the masked RBM, which explicitly models occlusion boundaries in image patches by factoring the appearance of any patch region from its shape, and proposes a generative model of larger images using a field of such RBMs.

Picture: A probabilistic programming language for scene perception

Picture is presented, a probabilistic programming language for scene understanding that allows researchers to express complex generative vision models, while automatically solving them using fast general-purpose inference machinery.

Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation

This work addresses the unsupervised learning of several interconnected problems in low-level vision: single view depth prediction, camera motion estimation, optical flow, and segmentation of a video into the static scene and moving regions with Competitive Collaboration, a framework that facilitates the coordinated training of multiple specialized neural networks to solve complex problems.