Phenaki: Variable Length Video Generation From Open Domain Textual Description
@article{Villegas2022PhenakiVL, title={Phenaki: Variable Length Video Generation From Open Domain Textual Description}, author={Ruben Villegas and Mohammad Babaeizadeh and Pieter-Jan Kindermans and Hernan Moraldo and Han Zhang and Mohammad Taghi Saffar and Santiago Castro and Julius Kunze and D. Erhan}, journal={ArXiv}, year={2022}, volume={abs/2210.02399} }
We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work…
Figures and Tables from this paper
27 Citations
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
- Computer ScienceArXiv
- 2022
Multimodal Masked Video Generation (MMVG), a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions, is proposed.
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
- Computer ScienceArXiv
- 2022
A new T2V generation setting, where only one text-video pair is presented, and Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy, is introduced.
MAGVIT: Masked Generative Video Transformer
- Computer ScienceArXiv
- 2022
A 3D tokenizer to quantize a video into spatial-temporal visual tokens and an embedding method for masked video token modeling to facilitate multi-task learning are introduced.
Mo\^usai: Text-to-Music Generation with Long-Context Latent Diffusion
- Computer Science
- 2023
A cascading latent diffusion approach that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions is developed, targeting real-time on a single consumer GPU.
Video-P2P: Video Editing with Cross-attention Control
- Computer ScienceArXiv
- 2023
This paper proposes to first tune a Text-to-Set (T2S) model to complete an approximate inversion and then optimize a shared unconditional embedding to achieve accurate video inversion with a small memory cost.
Structure and Content-Guided Video Synthesis with Diffusion Models
- Computer Science
- 2023
This work presents a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output, trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method.
Text-To-4D Dynamic Scene Generation
- Computer ScienceArXiv
- 2023
The approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model.
SinFusion: Training Diffusion Models on a Single Image or Video
- Computer ScienceArXiv
- 2022
This paper trains a diffusion model that learns the appearance and dynamics of the single image or video, while utilizing the conditioning capabilities of diffusion models, and can solve a wide array of image/video-specific manipulation tasks.
Image-and-Language Understanding from Pixels Only
- Computer ScienceArXiv
- 2022
This work explores an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks and exploits the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multi-modal retrieval without modi fications.
OhMG: Zero-shot Open-vocabulary Human Motion Generation
- Computer ScienceArXiv
- 2022
Extensive experiments show that the proposed controllable andexible motion generation framework can generate better text-consistent poses and motions across various baselines and metrics.
References
SHOWING 1-10 OF 62 REFERENCES
GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
- Computer ScienceArXiv
- 2021
GODIVA is proposed, an open-domain textto-video pretrained model that can generate videos from text in an auto-regressive manner using a three-dimensional sparse attention mechanism and a new metric called Relative Matching to automatically evaluate the video generation quality.
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
- Computer ScienceArXiv
- 2022
This work presents 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2, and proposes multi-frame-rate hierarchical training strategy to better align text and video clips.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
- Computer ScienceArXiv
- 2022
The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.
Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis
- Computer ScienceIJCAI
- 2019
This work introduces Text-Filter conditioning Generative Adversarial Network (TFGAN), a conditional GAN model with a novel multi-scale text-conditioning scheme that improves text-video associations and combines the proposed conditioning scheme with a deep GAN architecture.
Learning Audio-Video Modalities from Image Captions
- Computer ScienceECCV
- 2022
A new video mining pipeline is proposed which involves transferring captions from image captioning datasets to video clips with no additional manual effort, and it is shown that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning.
VideoGPT: Video Generation using VQ-VAE and Transformers
- Computer ScienceArXiv
- 2021
Despite the simplicity in formulation and ease of training, the proposed architecture is able to generate samples competitive with state of theart GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset (TGIF).
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
- Computer ScienceECCV
- 2022
Compared to several strong baselines, N\"UWA achieves state-of-the-art results on text-to-image generation, text- to-video generation, video prediction, etc, and it also shows surprisingly good zero-shot capabilities onText-guided image and video manipulation tasks.
ViViT: A Video Vision Transformer
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
An end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets and yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.
Video (language) modeling: a baseline for generative models of natural videos
- Computer ScienceArXiv
- 2014
For the first time, it is shown that a strong baseline model for unsupervised feature learning using video data can predict non-trivial motions over short video sequences.