• Corpus ID: 244714856

Vector Quantized Diffusion Model for Text-to-Image Synthesis

@article{Gu2021VectorQD,
  title={Vector Quantized Diffusion Model for Text-to-Image Synthesis},
  author={Shuyang Gu and Dong Chen and Jianmin Bao and Fang Wen and Bo Zhang and Dongdong Chen and Lu Yuan and Baining Guo},
  journal={ArXiv},
  year={2021},
  volume={abs/2111.14822}
}
We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a… 
KNN-Diffusion: Image Generation via Large-Scale Retrieval
TLDR
This work shows how large-scale retrieval methods, in particular efficient K-Nearest-Neighbors (KNN) search, can be used in order to train a model to adapt to new samples, and achieves state of the art results both in human evaluations as well as with perceptual scores.
Improved Vector Quantized Diffusion Models
TLDR
A high-quality inference strategy to alleviate the joint distribution issue in VQ-Diffusion is presented and a more general and effective implementation of classi-free guidance sampling for discrete denoising diffusion model is proposed.
Discrete Contrastive Diffusion for Cross-Modal and Conditional Generation
TLDR
This work introduces a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process, and forms CDCD by connecting it with the conventional variational objectives.
StudioGAN: A Taxonomy and Benchmark of GANs for Image Synthesis
  • Minguk Kang, Joonghyuk Shin, Jaesik Park
  • Computer Science
  • 2022
Generative Adversarial Network (GAN) is one of the state-of-the-art generative models for realistic image synthesis. While training and evaluating GAN becomes increasingly important, the current GAN
Lossy Compression with Gaussian Diffusion
We describe a novel lossy compression approach called DiffC which is based on unconditional diffusion generative models. Unlike modern compression schemes which rely on transform coding and
Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer
TLDR
An effective image generation framework of Draft-and-Revise with Contextual RQ-transformer to consider global contexts during the generation process to improve the quality of images, while preserving the global contexts of generated images is proposed.
Fast Unsupervised Brain Anomaly Detection and Segmentation with Diffusion Models
TLDR
This work proposes a method based on diffusion models to detect and segment anomalies in brain imaging by training the models on healthy data and then exploring its diffusion and reverse steps across its Markov chain, which achieves competitive performance compared with autoregressive approaches.
Blended Latent Diffusion
TLDR
Applications of the method include adding a new object in the masked area guided by the text prompt, altering a part within an existing object, generation of text, and generating multiple predictions for the same text prompt.
DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder
TLDR
This work proposes a VQ-VAE architecture model with a DiVAE to work as the reconstructing component in image synthesis and explores how to input image embedding into diffusion model for excellent performance and finds that simple modification on diffusion’s UNet can achieve it.
Text2Human: Text-Driven Controllable Human Image Generation
TLDR
The proposed Text2Human framework can generate more diverse and realistic human images compared to state-of-the-art methods and prediction for finer level indices refines the quality of clothing textures.
...
...

References

SHOWING 1-10 OF 76 REFERENCES
CogView: Mastering Text-to-Image Generation via Transformers
TLDR
This work proposes CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance text-to-Image generation in the general domain and achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.
Diffusion Models Beat GANs on Image Synthesis
TLDR
It is shown that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models, and classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet 256 × 256 and 3.85 on imageNet 512 × 512.
Zero-Shot Text-to-Image Generation
TLDR
This work describes a simple approach based on a transformer that autoregressively models the text and image tokens as a single stream of data that is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Neural Discrete Representation Learning
TLDR
Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
Microsoft COCO: Common Objects in Context
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene
ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis
TLDR
The resulting autoregressive ImageBART model can solve free-form image inpainting and, in the case of conditional models, local, text-guided image modification without requiring mask-specific training and can take unrestricted, user-provided masks into account to perform local image editing.
Argmax Flows and Multinomial Diffusion: Towards Non-Autoregressive Language Models
TLDR
This paper introduces two new classes of generative models for categorical data such as language or image segmentation: Argmax Flows and Multinomial Diffusion.
Taming Transformers for High-Resolution Image Synthesis
TLDR
It is demonstrated how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.
Denoising Diffusion Probabilistic Models
TLDR
High quality image synthesis results are presented using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics, which naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.
...
...