Hierarchical Text-Conditional Image Generation with CLIP Latents

  title={Hierarchical Text-Conditional Image Generation with CLIP Latents},
  author={Aditya Ramesh and Prafulla Dhariwal and Alex Nichol and Casey Chu and Mark Chen},
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our… 

CLIPVG: Text-Guided Image Manipulation Using Differentiable Vector Graphics

It is demonstrated that CLIPVG can not only achieve state-of-art performance in both semantic correctness and synthesis quality, but also is able to support various applications far beyond the capability of all existing methods.

Progressive Text-to-Image Generation

This paper presents a progressive model for high-fidelity text-to-image generation that produces better results compared with the previous VQ-AR method in FID score across a wide variety of categories.

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.

Retrieval-Augmented Diffusion Models

This work proposes to complement the diffusion model with a retrieval-based approach and to introduce an explicit memory in the form of an external database to achieve highly competitive performance on tasks for which it has not been explicitly trained.

Progressive Denoising Model for Fine-Grained Text-to-Image Generation

This paper presents a progressive denoising model for highfidelity text-to-image image generation that allows achieving a better tradeoff between generation quality and speed.

clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP

We introduce a new method to efficiently create text-to-image models from a pretrained CLIP and StyleGAN. It enables text driven sampling with an existing generative model without any external data or

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

High-quality results are demonstrated on versatile text-guided image translation tasks, including translating sketches, rough drawings and animations into realistic images, changing of the class and appearance of objects in a given image, and modifica-tions of global qualities such as lighting and color.

Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces

Paella is introduced, a novel text-to-image model requiring less than 10 steps to sample high-fidelity images, using a speed-optimized architecture allowing to sample a single image in less than 500 ms, while having 573M parameters.

Implementing and Experimenting with Diffusion Models for Text-to-Image Generation

This thesis provides additional and analyses deeper than the ones performed by the authors of DALL-E 2, including ablation studies, and introduces a new guidance method which can be used in conjunction with other guidance methods to improve image quality.

The Biased Artist: Exploiting Cultural Biases via Homoglyphs in Text-Guided Image Generation Models

It is demonstrated that common multimodal models implicitly learned cultural biases that can be triggered and injected into the generated images by simply replacing single characters in the textual description with visually similar non-Latin characters.



Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

This research work presents CLIP-GLaSS, a novel zero-shot framework to generate an image corresponding to a given caption, based on the CLIP neural network, which, given an image and a descriptive caption, provides similar embeddings.

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

Leveraging the semantic power of large scale Contrastive-Language-Image-Pretraining (CLIP) models, this work presents a text-driven method that allows shifting a generative model to new domains, without having to collect even a single image.

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

This work explores leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort.

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

This work explores diffusion models for the problem of text-conditional image synthesis and compares two different guidance strategies: CLIP guidance and classifier-free guidance, finding that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.

Vector Quantized Diffusion Model for Text-to-Image Synthesis

  • Shuyang GuDong Chen B. Guo
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
This method is based on a vector quantized variational autoencoder whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM), and it is found that this latent-space method is well-suited for text-to-image generation tasks.

Towards Language-Free Training for Text-to-Image Generation

The first work to train text-to-image generation models without any text data is proposed, which leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features.

Cross-Modal Contrastive Learning for Text-to-Image Generation

The Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses the challenge of text-to-image synthesis systems by maximizing the mutual information between image and text via multiple contrastive losses which capture inter- modality and intra-modality correspondences.

High-Resolution Image Synthesis with Latent Diffusion Models

These latent diffusion models achieve new state of the art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including unconditional image generation, text-to-image synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

This paper explores the possibility of using DPMs for representation learning and seeks to extract a meaningful and decodable representation of an input image via autoencoding and can encode any image into a two-part latent code, allowing near-exact reconstruction.

Learning Visual Representations with Caption Annotations

It is argued that captioned images are easily crawlable and can be exploited to supervise the training of visual representations, and proposed hybrid models, with dedicated visual and textual encoders, show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks.