V3GAN: Decomposing Background, Foreground and Motion for Video Generation

@inproceedings{Keshari2022V3GANDB,
  title={V3GAN: Decomposing Background, Foreground and Motion for Video Generation},
  author={Arti Keshari and Sonam Gupta and Sukhendu Das},
  booktitle={BMVC},
  year={2022}
}
Video generation is a challenging task that requires modeling plausible spatial and temporal dynamics in a video. Inspired by how humans perceive a video by grouping a scene into moving and stationary components, we propose a method that decomposes the task of video generation into the synthesis of foreground, background and motion. Foreground and background together describe the appearance, whereas motion specifies how the foreground moves in a video over time. We propose V3GAN, a novel three… 

RV-GAN: Recurrent GAN for Unconditional Video Generation

Detailed quantitative and qualitative analysis shows that RV-GAN outperforms state-of-the-art methods by a significant margin on Moving MNIST, MUG, Weizmann and UCF101 datasets and can be easily adapted to other tasks like class-conditional video synthesis and text-to-video synthesis.

References

SHOWING 1-10 OF 35 REFERENCES

MoCoGAN: Decomposing Motion and Content for Video Generation

This work introduces a novel adversarial learning scheme utilizing both image and video discriminators and shows that MoCoGAN allows one to generate videos with same content but different motion as well as videos with different content and same motion.

ImaGINator: Conditional Spatio-Temporal GAN for Video Generation

A novel conditional GAN architecture, namely ImaGINator, which given a single image, a condition (label of a facial expression or action) and noise, decomposes appearance and motion in both latent and high level feature spaces, generating realistic videos.

Generating Videos with Scene Dynamics

A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.

Unsupervised object-centric video generation and decomposition in 3D

This work proposes to model a video as the view seen while moving through a scene with multiple 3D objects and a 3D background, and evaluates its method on depth-prediction and 3D object detection and shows it out-performs them even on 2D instance segmentation and tracking.

Deep multi-scale video prediction beyond mean square error

This work trains a convolutional network to generate future frames given an input sequence and proposes three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function.

Depth-Aware Video Frame Interpolation

A video frame interpolation method which explicitly detects the occlusion by exploring the depth information, and develops a depth-aware flow projection layer to synthesize intermediate flows that preferably sample closer objects than farther ones.

Video Frame Synthesis Using Deep Voxel Flow

This work addresses the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation), by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which is called deep voxel flow.

Predicting Future Frames Using Retrospective Cycle GAN

  • Y. KwonM. Park
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
This paper proposes a unified generative adversarial network for predicting accurate and temporally consistent future frames over time, even in a challenging environment, and employs two discriminators not only to identify fake frames but also to distinguish fake contained image sequences from the real sequence.

Temporal Generative Adversarial Nets with Singular Value Clipping

A generative model which can learn a semantic representation of unlabeled videos, and is capable of generating videos, is proposed, and a novel method to train it stably in an end-to-end manner is proposed.

Flow-Grounded Spatial-Temporal Video Prediction from Still Images

This work forms the multi-frame prediction task as a multiple time step flow (multi-flow) prediction phase followed by a flow-to-frame synthesis phase, which prevents the model from directly looking at the high-dimensional pixel space of the frame sequence and is demonstrated to be more effective in predicting better and diverse results.