• Corpus ID: 12712095

Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks

  title={Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks},
  author={Tianfan Xue and Jiajun Wu and Katherine L. Bouman and Bill Freeman},
We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods, which have tackled this problem in a deterministic or non-parametric way, we propose a novel approach that models future frames in a probabilistic manner. Our probabilistic model makes it possible for us to sample and synthesize many possible future frames from a single input image. Future frame synthesis is challenging, as it involves low- and high-level image… 

Figures from this paper

Dynamic Visual Sequence Prediction with Motion Flow Networks

This work proposes two novel network structures to synthesize realistic movement of objects under weak supervision (without pre-computed dense motion fields), which performs better on synthetic as well as on real-world human body movement sequences.

Animating Landscape: Self-Supervised Learning of Decoupled Motion and Appearance for Single-Image Video Synthesis

This paper proposes a method that can create a high-resolution, long-term animation using convolutional neural networks (CNNs) from a single landscape image where it mainly focus on skies and waters.

Animating landscape

This paper proposes a method that can create a high-resolution, long-term animation using convolutional neural networks (CNNs) from a single landscape image where it mainly focus on skies and waters.

ImaGINator: Conditional Spatio-Temporal GAN for Video Generation

A novel conditional GAN architecture, namely ImaGINator, which given a single image, a condition (label of a facial expression or action) and noise, decomposes appearance and motion in both latent and high level feature spaces, generating realistic videos.

Disentangling Content and Motion for Text-Based Neural Video Manipulation

This paper introduces a new method called DiCoMoGAN for manipulating videos with natural language, aiming to perform local and semantic edits on a video clip to alter the appearances of an object of interest.

Predicting Diverse Future Frames With Local Transformation-Guided Masking

A novel video prediction system that focuses on regions of interest (ROIs) rather than on entire frames and learns frame evolutions at the transformation level rather than at the pixel level, which enables the system to generate high-quality long-term future frames without severely amplified signal loss.

Video Frame Synthesis Using Deep Voxel Flow

This work addresses the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation), by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which is called deep voxel flow.

Motion Selective Prediction for Video Frame Synthesis

This work introduces a model that learns from the first frames of a given video and extends its content and motion, to, eg, double its length, and proposes a dual network that can use in a flexible way both dynamic and static convolutional motion kernels, to predict future frames.

Generating Future Frames with Mask-Guided Prediction

Current approaches in video prediction tend to hallucinate the future frames directly or learn global motion transformation from the entire scene. However, it is difficult for these methods without

TwoStreamVAN: Improving Motion Modeling in Video Generation

A two-stream model that disentangles motion generation from content generation, called a Two-Stream Variational Adversarial Network (TwoStreamVAN) is proposed, able to create clear and consistent motion, and thus yields photorealistic videos.



Deep multi-scale video prediction beyond mean square error

This work trains a convolutional network to generate future frames given an input sequence and proposes three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function.

View Synthesis by Appearance Flow

This work addresses the problem of novel view synthesis: given an input image, synthesizing new images of the same object or scene observed from arbitrary viewpoints and shows that for both objects and scenes, this approach is able to synthesize novel views of higher perceptual quality than previous CNN-based techniques.

Dense Optical Flow Prediction from a Static Image

This work presents a convolutional neural network (CNN) based approach for motion prediction that outperform all previous approaches by large margins and can predict future optical flow on a diverse set of scenarios.

Generating Videos with Scene Dynamics

A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.

Deep Visual Analogy-Making

A novel deep network trained end-to-end to perform visual analogy making, which is the task of transforming a query image according to an example pair of related images, is developed.

Anticipating Visual Representations from Unlabeled Video

This work presents a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects and applies recognition algorithms on the authors' predicted representation to anticipate objects and actions.

An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders

A conditional variational autoencoder is proposed for predicting the dense trajectory of pixels in a scene—what will move in the scene, where it will travel, and how it will deform over the course of one second.

Action-Conditional Video Prediction using Deep Networks in Atari Games

This paper is the first to make and evaluate long-term predictions on high-dimensional video conditioned by control inputs and proposes and evaluates two deep neural network architectures that consist of encoding, action-conditional transformation, and decoding layers based on convolutional neural networks and recurrent neural networks.

Unsupervised Learning of Video Representations using LSTMs

This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.

Dynamic Filter Networks

The Dynamic Filter Network is introduced, where filters are generated dynamically conditioned on an input, and it is shown that this architecture is a powerful one, with increased flexibility thanks to its adaptive nature, yet without an excessive increase in the number of model parameters.