TwoStreamVAN: Improving Motion Modeling in Video Generation

  title={TwoStreamVAN: Improving Motion Modeling in Video Generation},
  author={Ximeng Sun and Huijuan Xu and Kate Saenko},
  journal={2020 IEEE Winter Conference on Applications of Computer Vision (WACV)},
Video generation is an inherently challenging task, as it requires modeling realistic temporal dynamics as well as spatial content. Existing methods entangle the two intrinsically different tasks of motion and content creation in a single generator network, but this approach struggles to simultaneously generate plausible motion and content. To improve motion modeling in video generation task, we propose a two-stream model that disentangles motion generation from content generation, called a Two… 
Adversarial Video Generation on Complex Datasets
This work shows that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity and fidelity than previous work.
Adversarial Self-Defense for Cycle-Consistent GANs
Making the translation model more robust to the self-adversarial attack increases its generation quality and reconstruction reliability and makes the model less sensitive to low-amplitude perturbations.
Disentangled Unsupervised Image Translation via Restricted Information Flow
This paper proposes a new method that does not rely on such inductive architectural biases, and infers which attributes are domain-specific from data by constraining information flow through the network using translation honesty losses and a penalty on the capacity of domain- specific embedding.
Evaluation of Correctness in Unsupervised Many-to-Many Image Translation
An extensive study how well the existing state-of-the-art UMMI2I translation methods preserve domaininvariant and manipulate domain-specific attributes, and discusses the trade-offs shared by all methods, as well as how different architectural choices affect various aspects of semantic correctness.
Generative Adversarial Networks in Human Emotion Synthesis: A Review
A comprehensive survey of recent advances in human emotion synthesis by studying available databases, advantages, and disadvantages of the generative models along with the related training strategies considering two principal human communication modalities, namely audio and video.


MoCoGAN: Decomposing Motion and Content for Video Generation
This work introduces a novel adversarial learning scheme utilizing both image and video discriminators and shows that MoCoGAN allows one to generate videos with same content but different motion as well as videos with different content and same motion.
Decomposing Motion and Content for Natural Video Sequence Prediction
To the best of the knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos.
Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks
A novel approach that models future frames in a probabilistic manner is proposed, namely a Cross Convolutional Network to aid in synthesizing future frames; this network structure encodes image and motion information as feature maps and convolutional kernels, respectively.
Generating Videos with Scene Dynamics
A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.
Probabilistic Video Generation using Holistic Attribute Control
Improve the video generation consistency through temporally-conditional sampling and quality by structuring the latent space with attribute controls; ensuring that attributes can be both inferred and conditioned on during learning/generation.
Deep multi-scale video prediction beyond mean square error
This work trains a convolutional network to generate future frames given an input sequence and proposes three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function.
Temporal Generative Adversarial Nets with Singular Value Clipping
A generative model which can learn a semantic representation of unlabeled videos, and is capable of generating videos, is proposed, and a novel method to train it stably in an end-to-end manner is proposed.
Stochastic Variational Video Prediction
This paper develops a stochastic variational video prediction (SV2P) method that predicts a different possible future for each sample of its latent variables, and is the first to provide effective Stochastic multi-frame prediction for real-world video.
Stochastic Video Generation with a Learned Prior
An unsupervised video generation model that learns a prior model of uncertainty in a given environment and generates video frames by drawing samples from this prior and combining them with a deterministic estimate of the future frame.
Actions as Space-Time Shapes
The method is fast, does not require video alignment, and is applicable in many scenarios where the background is known, and the robustness of the method is demonstrated to partial occlusions, nonrigid deformations, significant changes in scale and viewpoint, high irregularities in the performance of an action, and low-quality video.