Phenaki: Variable Length Video Generation From Open Domain Textual Description

  title={Phenaki: Variable Length Video Generation From Open Domain Textual Description},
  author={Ruben Villegas and Mohammad Babaeizadeh and Pieter-Jan Kindermans and Hernan Moraldo and Han Zhang and Mohammad Taghi Saffar and Santiago Castro and Julius Kunze and D. Erhan},
We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work… 

Figures and Tables from this paper

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Multimodal Masked Video Generation (MMVG), a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions, is proposed.

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

A new T2V generation setting, where only one text-video pair is presented, and Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy, is introduced.

MAGVIT: Masked Generative Video Transformer

A 3D tokenizer to quantize a video into spatial-temporal visual tokens and an embedding method for masked video token modeling to facilitate multi-task learning are introduced.

Mo\^usai: Text-to-Music Generation with Long-Context Latent Diffusion

A cascading latent diffusion approach that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions is developed, targeting real-time on a single consumer GPU.

Video-P2P: Video Editing with Cross-attention Control

This paper proposes to first tune a Text-to-Set (T2S) model to complete an approximate inversion and then optimize a shared unconditional embedding to achieve accurate video inversion with a small memory cost.

Structure and Content-Guided Video Synthesis with Diffusion Models

This work presents a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output, trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method.

Text-To-4D Dynamic Scene Generation

The approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model.

SinFusion: Training Diffusion Models on a Single Image or Video

This paper trains a diffusion model that learns the appearance and dynamics of the single image or video, while utilizing the conditioning capabilities of diffusion models, and can solve a wide array of image/video-specific manipulation tasks.

Image-and-Language Understanding from Pixels Only

This work explores an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks and exploits the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multi-modal retrieval without modi fications.

OhMG: Zero-shot Open-vocabulary Human Motion Generation

Extensive experiments show that the proposed controllable andexible motion generation framework can generate better text-consistent poses and motions across various baselines and metrics.



GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

GODIVA is proposed, an open-domain textto-video pretrained model that can generate videos from text in an auto-regressive manner using a three-dimensional sparse attention mechanism and a new metric called Relative Matching to automatically evaluate the video generation quality.

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

This work presents 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2, and proposes multi-frame-rate hierarchical training strategy to better align text and video clips.

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.

Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis

This work introduces Text-Filter conditioning Generative Adversarial Network (TFGAN), a conditional GAN model with a novel multi-scale text-conditioning scheme that improves text-video associations and combines the proposed conditioning scheme with a deep GAN architecture.

Learning Audio-Video Modalities from Image Captions

A new video mining pipeline is proposed which involves transferring captions from image captioning datasets to video clips with no additional manual effort, and it is shown that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning.

VideoGPT: Video Generation using VQ-VAE and Transformers

Despite the simplicity in formulation and ease of training, the proposed architecture is able to generate samples competitive with state of theart GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset (TGIF).

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Compared to several strong baselines, N\"UWA achieves state-of-the-art results on text-to-image generation, text- to-video generation, video prediction, etc, and it also shows surprisingly good zero-shot capabilities onText-guided image and video manipulation tasks.

ViViT: A Video Vision Transformer

This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

An end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets and yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.

Video (language) modeling: a baseline for generative models of natural videos

For the first time, it is shown that a strong baseline model for unsupervised feature learning using video data can predict non-trivial motions over short video sequences.