Corpus ID: 231719572

Playable Video Generation

  title={Playable Video Generation},
  author={Willi Menapace and St{\'e}phane Lathuili{\`e}re and S. Tulyakov and Aliaksandr Siarohin and E. Ricci},
This paper introduces the unsupervised learning problem of playable video generation (PVG). In PVG, we aim at allowing a user to control the generated video by selecting a discrete action at every time step as when playing a video game. The difficulty of the task lies both in learning semantically consistent actions and in generating realistic videos conditioned on the user input. We propose a novel framework for PVG that is trained in a self-supervised manner on a large dataset of unlabelled… Expand

Figures and Tables from this paper

StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN
This paper presents a novel approach to the video synthesis problem that helps to greatly improve visual quality and drastically reduce the amount of training data and resources necessary for generating video content. Expand
CCVS: Context-aware Controllable Video Synthesis
This presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones, with several new key elements for improved spatial resolution and realism: ItExpand
Strumming to the Beat: Audio-Conditioned Contrastive Video Textures
A non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning, which outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audiovisual cues in order to synthesize videos that synchronize well with an audio signal. Expand
Motion Representations for Articulated Animation
We propose novel motion representations for animating articulated objects consisting of distinct parts. In a completely unsupervised manner, our method identifies object parts, tracks them in aExpand


MoCoGAN: Decomposing Motion and Content for Video Generation
This work introduces a novel adversarial learning scheme utilizing both image and video discriminators and shows that MoCoGAN allows one to generate videos with same content but different motion as well as videos with different content and same motion. Expand
ImaGINator: Conditional Spatio-Temporal GAN for Video Generation
A novel conditional GAN architecture, namely ImaGINator, which given a single image, a condition (label of a facial expression or action) and noise, decomposes appearance and motion in both latent and high level feature spaces, generating realistic videos. Expand
Action-Conditional Video Prediction using Deep Networks in Atari Games
This paper is the first to make and evaluate long-term predictions on high-dimensional video conditioned by control inputs and proposes and evaluates two deep neural network architectures that consist of encoding, action-conditional transformation, and decoding layers based on convolutional neural networks and recurrent neural networks. Expand
Efficient Video Generation on Complex Datasets
This work shows that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity than previous work. Expand
Towards Accurate Generative Models of Video: A New Metric & Challenges
A large-scale human study is contributed, which confirms that FVD correlates well with qualitative human judgment of generated videos, and provides initial benchmark results on SCV. Expand
Deep multi-scale video prediction beyond mean square error
This work trains a convolutional network to generate future frames given an input sequence and proposes three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function. Expand
Predicting Future Frames Using Retrospective Cycle GAN
  • Y. Kwon, M. Park
  • Computer Science
  • 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
This paper proposes a unified generative adversarial network for predicting accurate and temporally consistent future frames over time, even in a challenging environment, and employs two discriminators not only to identify fake frames but also to distinguish fake contained image sequences from the real sequence. Expand
Everybody Dance Now
This paper presents a simple method for “do as I do” motion transfer: given a source video of a person dancing, we can transfer that performance to a novel (amateur) target after only a few minutesExpand
Scaling Autoregressive Video Models
It is shown that conceptually simple autoregressive video generation models based on a three-dimensional self-attention mechanism achieve competitive results across multiple metrics on popular benchmark datasets, for which they produce continuations of high fidelity and realism. Expand
Anticipating the future by watching unlabeled video
A large scale framework that capitalizes on temporal structure in unlabeled video to learn to anticipate both actions and objects in the future, and suggests that learning with unlabeling videos significantly helps forecast actions and anticipate objects. Expand