Diffusion Models for Video Prediction and Infilling

  title={Diffusion Models for Video Prediction and Infilling},
  author={Tobias Hoppe and Arash Mehrjou and Stefan Bauer and Didrik Nielsen and Andrea Dittadi},
To predict and anticipate future outcomes or reason about missing information in a sequence is a key ability for agents to be able to make intelligent decisions. This requires strong temporally coherent generative capabilities. Diffusion models have shown huge success in several generative tasks lately, but have not been extensively explored in the video domain. We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions, and introduces a… 

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Multimodal Masked Video Generation (MMVG), a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions, is proposed.

Diffusion Models in Vision: A Survey

A multi-perspective categorization of diffusion models applied in computer vision, including variational auto-encoders, generative adversarial networks, energy-based models, autoregressive models and normalizing models is introduced.

Randomized Conditional Flow Matching for Video Prediction

A novel generative model for video prediction based on latent flow matching that efficiently and effectively takes the past into account by conditioning at inference time only on a small random set of past frames at each integration step of the learned flow.

SinFusion: Training Diffusion Models on a Single Image or Video

The authors' image/video-specific diffusion model (SinFusion) learns the appearance and dynamics of the single image or video, while utilizing the conditioning capabilities of diffusion models, and shows comparable performance and capabilities to previous single-image models in various image manipulation tasks.

Listen, denoise, action! Audio-driven motion synthesis with diffusion models

Diffusion models are shown to be an excellent model for synthesising human motion that co-occurs with audio, for example co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description.


  • Computer Science
  • 2022
Dynamic Latent Hierarchy is introduced – a deep hierarchical latent model that represents videos as a hierarchy of latent states that evolve over separate and fluid timescales and is able to better represent stochasticity, as well as to dynamically adjust its hierarchical and temporal structure.

Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance

An alter-native, Gaussian formulation of the latent space of various diffusion models, as well as an invertible DPM-Encoder that maps images into the latent spaces of these models, are provided.

Temporally Consistent Video Transformer for Long-Term Video Prediction

This work presents Temporally Consistent Video Transformer (TECO), a vector-quantized latent dynamics video prediction model that learns compressed representations to efficiently condition on long videos of hundreds of frames during both training and generation using a MaskGit prior for dynamics prediction.

Phenaki: Variable Length Video Generation From Open Domain Textual Description

A new model for learning video representation which compresses the video to a small representation of discrete tokens, which results in better spatio-temporal consistency and joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets.

A Survey on Generative Diffusion Model

A diverse range of advanced techniques to speed up the diffusion models – training schedule, training-free sampling, mixed-modeling, and score & diffusion unification are presented.



MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

This work devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames.

Clockwork Variational Autoencoders

This work introduces the Clockwork VAE (CW-VAE), a video prediction model that leverages a hierarchy of latent sequences, where higher levels tick at slower intervals, and confirms that slower levels learn to represent objects that change more slowly in the video, and faster levels learning to represent faster objects.

Video Diffusion Models

A diffusion model for video generation is proposed that shows very promising initial results, and a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods is introduced.

FitVid: Overfitting in Pixel-Level Video Prediction

This paper argues that the inefficient use of parameters in the current video models is the main reason for underfitting, and introduces a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks while having similar parameter count as the current state-of-the-art models.

Generating Videos with Scene Dynamics

A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.

Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction

This work introduces Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns highfidelity video predictions by greedily training each level of a hierarchical autoencoder and can improve performance monotonically by simply adding more modules.

Stochastic Variational Video Prediction

This paper develops a stochastic variational video prediction (SV2P) method that predicts a different possible future for each sample of its latent variables, and is the first to provide effective Stochastic multi-frame prediction for real-world video.

A Review on Deep Learning Techniques for Video Prediction

This work carefully analyzes existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and their significance in the field, and provides a review on the deep learning methods for prediction in video sequences.

Probabilistic Future Prediction for Video Scene Understanding

This work is the first to jointly predict ego-motion, static scene, and the motion of dynamic agents in a probabilistic manner, which allows sampling consistent, highly probable futures from a compact latent space.

Transformation-based Adversarial Video Prediction on Large-Scale Data

This work proposes a novel recurrent unit which transforms its past hidden state according to predicted motion-like features, and refines it to to handle dis-occlusions, scene changes and other complex behavior, and shows that this recurrent unit consistently outperforms previous designs.