• Corpus ID: 231741196

Self-Supervised Equivariant Scene Synthesis from Video

@article{Resnick2021SelfSupervisedES,
  title={Self-Supervised Equivariant Scene Synthesis from Video},
  author={Cinjon Resnick and Or Litany and Cosmas Hei{\ss} and H. Larochelle and Joan Bruna and Kyunghyun Cho},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.00863}
}
We propose a self-supervised framework to learn scene representations from video that are automatically delineated into background, characters, and their animations. Our method capitalizes on moving characters being equivariant with respect to their transformation across frames and the background being constant with respect to that same transformation. After training, we can manipulate image encodings in real time to create unseen combinations of the delineated components. As far as we know, we… 

References

SHOWING 1-10 OF 27 REFERENCES

First Order Motion Model for Image Animation

This framework decouple appearance and motion information using a self-supervised formulation and uses a representation consisting of a set of learned keypoints along with their local affine transformations to support complex motions.

Generating Videos with Scene Dynamics

A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.

Animating Arbitrary Objects via Deep Motion Transfer

This paper introduces a novel deep learning framework for image animation that generates a video in which the target object is animated according to the driving sequence through a deep architecture that decouples appearance and motion information.

Motion-supervised Co-Part Segmentation

This work proposes a self-supervised deep learning method for co-part segmentation that develops the idea that motion information inferred from videos can be leveraged to discover meaningful object parts.

DwNet: Dense warp-based network for pose-guided human video generation

This paper focuses on human motion transfer - generation of a video depicting a particular subject, observed in a single image, performing a series of motions exemplified by an auxiliary (driving) video.

Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks

A novel approach that models future frames in a probabilistic manner is proposed, namely a Cross Convolutional Network to aid in synthesizing future frames; this network structure encodes image and motion information as feature maps and convolutional kernels, respectively.

Learning a Generative Model of Images by Factoring Appearance and Shape

This work introduces a basic model, the masked RBM, which explicitly models occlusion boundaries in image patches by factoring the appearance of any patch region from its shape, and proposes a generative model of larger images using a field of such RBMs.

Unsupervised Learning of Disentangled Representations from Video

We present a new model DrNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation

Decomposing Motion and Content for Natural Video Sequence Prediction

To the best of the knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos.

MoCoGAN: Decomposing Motion and Content for Video Generation

This work introduces a novel adversarial learning scheme utilizing both image and video discriminators and shows that MoCoGAN allows one to generate videos with same content but different motion as well as videos with different content and same motion.