• Corpus ID: 54458806

Towards Accurate Generative Models of Video: A New Metric & Challenges

  title={Towards Accurate Generative Models of Video: A New Metric \& Challenges},
  author={Thomas Unterthiner and Sjoerd van Steenkiste and Karol Kurach and Rapha{\"e}l Marinier and Marcin Michalski and Sylvain Gelly},
Recent advances in deep generative models have lead to remarkable progress in synthesizing high quality images. [] Key Method To this extent we propose Frechet Video Distance (FVD), a new metric for generative models of video based on FID, and StarCraft 2 Videos (SCV), a collection of progressively harder datasets that challenge the capabilities of the current iteration of generative models for video. We conduct a large-scale human study, which confirms that FVD correlates well with qualitative human…

Efficient Video Generation on Complex Datasets

This work shows that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity than previous work.

StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN

This work presents a novel approach to the video synthesis problem that helps to greatly improve visual quality and drastically reduce the amount of training data and resources necessary for generating videos.

Paying Attention to Video Generation

This work proposes a novel Attention-based Discretized Autoencoder (ADAE) which learns a finite-sized codebook that serves as a basis for latent space representations of frames, to be modelled by the sequential model.

Adversarial Video Generation on Complex Datasets

This work shows that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity and fidelity than previous work.

Transformation-based Adversarial Video Prediction on Large-Scale Data

This work proposes a novel recurrent unit which transforms its past hidden state according to predicted motion-like features, and refines it to to handle dis-occlusions, scene changes and other complex behavior, and shows that this recurrent unit consistently outperforms previous designs.

Generating Long Videos of Dynamic Scenes

A video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time is presented that prioritize the time axis by redesigning the temporal latent representation and learning long-term consistency from data by training on longer videos.

Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths

A lightweight video diffusion models that synthesize high-fidelity and arbitrary-long videos from pure noise are introduced that outperforms previous methods on 3D pixel space when under a limited computational budget.


This work shows that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity and fidelity than previous work, and proposes a model that scales to longer and higher resolution videos by leveraging a computationally efficient decomposition of its discriminator.

Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks

An INR-based video generator that improves the motion dynamics by manipulating the space and time coordinates differently and a motion discriminator that efficiently identifies the unnatural motions without observing the entire long frame sequences are introduced.

Towards Smooth Video Composition

This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs), and develops a novel B-Spline based motion representation to ensure temporal smoothness to achieve infinite-length video generation.



Generating Videos with Scene Dynamics

A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.

Video-to-Video Synthesis

This paper proposes a novel video-to-video synthesis approach under the generative adversarial learning framework, capable of synthesizing 2K resolution videos of street scenes up to 30 seconds long, which significantly advances the state-of-the-art of video synthesis.

Deep multi-scale video prediction beyond mean square error

This work trains a convolutional network to generate future frames given an input sequence and proposes three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function.

Large Scale GAN Training for High Fidelity Natural Image Synthesis

It is found that applying orthogonal regularization to the generator renders it amenable to a simple "truncation trick," allowing fine control over the trade-off between sample fidelity and variety by reducing the variance of the Generator's input.

Stochastic Video Generation with a Learned Prior

An unsupervised video generation model that learns a prior model of uncertainty in a given environment and generates video frames by drawing samples from this prior and combining them with a deterministic estimate of the future frame.

Hierarchical Long-term Video Prediction without Supervision

This work develops a novel training method that jointly trains the encoder, the predictor, and the decoder together without highlevel supervision and improves upon this by using an adversarial loss in the feature space to train the predictor.

Stochastic Variational Video Prediction

This paper develops a stochastic variational video prediction (SV2P) method that predicts a different possible future for each sample of its latent variables, and is the first to provide effective Stochastic multi-frame prediction for real-world video.

MoCoGAN: Decomposing Motion and Content for Video Generation

This work introduces a novel adversarial learning scheme utilizing both image and video discriminators and shows that MoCoGAN allows one to generate videos with same content but different motion as well as videos with different content and same motion.

Temporal Generative Adversarial Nets with Singular Value Clipping

A generative model which can learn a semantic representation of unlabeled videos, and is capable of generating videos, is proposed, and a novel method to train it stably in an end-to-end manner is proposed.

A note on the evaluation of generative models

This article reviews mostly known but often underappreciated properties relating to the evaluation and interpretation of generative models with a focus on image models and shows that three of the currently most commonly used criteria---average log-likelihood, Parzen window estimates, and visual fidelity of samples---are largely independent of each other when the data is high-dimensional.