Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

  title={Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding},
  author={Chitwan Saharia and William Chan and Saurabh Saxena and Lala Li and Jay Whang and Emily L. Denton and Seyed Kamyar Seyed Ghasemipour and Burcu Karagol Ayan and Seyedeh Sara Mahdavi and Raphael Gontijo Lopes and Tim Salimans and Jonathan Ho and David J. Fleet and Mohammad Norouzi},
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the… 

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

High-quality results are demonstrated on versatile text-guided image translation tasks, including translating sketches, rough drawings and animations into realistic images, changing of the class and appearance of objects in a given image, and modifica-tions of global qualities such as lighting and color.

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.

Shifted Diffusion for Text-to-image Generation

Corgi is based on the proposed shifted diffusion model, which achieves better image embedding generation from input text, and achieves new state-of-the-art results across different datasets on downstream language-free text-to-image generation tasks.

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

A new approach for “personalization” of text-to-image diffusion models (specializing them to users’ needs) and leveraging the semantic prior embedded in the model with a new autogenous class-specific prior preservation loss enables synthesizing the subject in diverse scenes, poses, views, and lighting conditions that do not appear in the reference images.

UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance

UPainting is proposed, which combines the power of large-scale Transformer language model in understanding language and image-text matching model in capturing cross-modal semantics and style, and greatly outperforms other models in terms of caption similarity and imagedelity in both simple and complex scenes.

Implementing and Experimenting with Diffusion Models for Text-to-Image Generation

This thesis provides additional and analyses deeper than the ones performed by the authors of DALL-E 2, including ablation studies, and introduces a new guidance method which can be used in conjunction with other guidance methods to improve image quality.

Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation

The proposed Swinv2-Imagen model outperforms the current best generative model, Imagen, on MSCOCO and ablation experiments reveal that the addition of semantic layouts is effective in improving the semantic understanding of the model.

eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

An ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark.

Sketch-Guided Text-to-Image Diffusion Models

This work introduces a universal approach to guide a pretrained text-to-image diffusion model, with a spatial map from another domain (e.g., sketch) during inference time, revealing a robust and ex-pressive way to generate images that follow the guidance of a sketch of arbitrary style or domain.

Re-Imagen: Retrieval-Augmented Text-to-Image Generator

The Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities, is presented.



Palette: Image-to-Image Diffusion Models

A unified framework for image-to-image translation based on conditional diffusion models is developed and it is shown that a generalist, multi-task diffusion model performs as well or better than task-specific specialist counterparts.

Towards Language-Free Training for Text-to-Image Generation

The first work to train text-to-image generation models without any text data is proposed, which leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features.

DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models

A novel DiffusionCLIP is presented which performs text-driven image manipulation with diffusion models using Contrastive Language–Image Pre-training (CLIP) loss and has a performance comparable to that of the modern GAN-based image processing methods for in and out-of-domain image processing tasks.

Improving Text-to-Image Synthesis Using Contrastive Learning

This work employs the contrastive learning method to enhance the consistency between the generated images from the captions related to the same image and boosts the FID significantly over AttnGAN and DM-GAN on datasets CUB and COCO.

Hierarchical Text-Conditional Image Generation with CLIP Latents

This work proposes a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the imageembedding, and shows that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.

DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis

  • Minfeng ZhuP. PanWei ChenYi Yang
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
The proposed DM-GAN model introduces a dynamic memory module to refine fuzzy image contents, when the initial images are not well generated, and performs favorably against the state-of-the-art approaches.

CoCa: Contrastive Captioners are Image-Text Foundation Models

A minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM.

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

It is shown that recent text-to-image generative transformer models perform better in recognizing and counting objects than recognizing colors and understanding spatial relations, while there exists a large gap between the model performances and upper bound accuracy on all skills.

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

This work explores diffusion models for the problem of text-conditional image synthesis and compares two different guidance strategies: CLIP guidance and classifier-free guidance, finding that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

A novel text-to-image method that addresses gaps by enabling a simple control mechanism complementary to text in the form of a scene, and introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects).