Retrieval-Augmented Diffusion Models

@article{Blattmann2022RetrievalAugmentedDM,
  title={Retrieval-Augmented Diffusion Models},
  author={A. Blattmann and Robin Rombach and K Oktay and Bj{\"o}rn Ommer},
  journal={ArXiv},
  year={2022},
  volume={abs/2204.11824}
}
Generative image synthesis with diffusion models has recently achieved excellent visual quality in several tasks such as text-based or class-conditional image synthesis. Much of this success is due to a dramatic increase in the computational capacity invested in training these models. This work presents an alternative approach: inspired by its successful application in natural language processing, we propose to complement the diffusion model with a retrieval-based approach and to introduce an… 

Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models

This note presents an alternative approach based on retrieval-augmented diffusion models (RDMs) that provides a novel way to prompt a general trained model after training and thereby specify a particular visual style.

Variational Distribution Learning for Unsupervised Text-to-Image Generation

This work employs a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space and, consequently, works well on zero-shot recognition tasks.

Learning Customized Visual Models with Retrieval-Augmented Knowledge

This work retrieves the most relevant image-text pairs from the web-scale database as external knowledge, and proposes REACT, REtrieval-Augmented CusTomization, a framework to acquire the relevant web knowledge to build customized visual models for target domains.

Re-Imagen: Retrieval-Augmented Text-to-Image Generator

The Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities, is presented.

SpaText: Spatio-Textual Representation for Controllable Image Generation

This work presents SpaText - a new method for text-to-image generation using open-vocabulary scene control, based on a novel CLIP-based spatio-textual representation, and shows its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based.

Few-Shot Diffusion Models

Few-Shot Diffusion Models (FSDM), a framework for few-shot generation leveraging conditional DDPMs, and how conditioning the model on patch-based input set information improves training convergence is shown.

Self-Guided Diffusion Models

This paper eliminates the need for image-annotation pairs for guidance by leveraging the flexibility of self-supervision signals to design a framework for self-guided diffusion models, and generates visually diverse yet semantically consistent images, without the need of any class, box, or segment label annotation.

Late-Constraint Diffusion Guidance for Controllable Image Synthesis

The introduced late-constraint strategy is equipped with a timestep resampling method and an early stopping technique, which boost the quality of synthesized image meanwhile complying with the guidance.

Detector Guidance for Multi-Object Text-to-Image Generation

Detector Guidance is introduced, which integrates a latent object detection model to separate different objects during the generation process and provides an 8-22\% advantage in preventing the amalgamation of conflicting concepts and ensuring that each object possesses its unique region without any human involvement and additional iterations.

Retrieval-Augmented Multimodal Language Modeling

Retrieval-Augmented CM3 is the first multimodal model that can retrieve and generate mixtures of text and images and exhibits novel capabilities such as knowledge-intensive image generation and multi-modal in-context learning.

KNN-Diffusion: Image Generation via Large-Scale Retrieval

This work proposes using large-scale retrieval methods, in particular, efficient k-Nearest-Neighbors (kNN), which offers novel capabilities: training a substantially small and efficient text-to-image diffusion model without any text, generating out-of-distribution images by simply swapping the retrieval database at inference time, and performing text-driven local semantic manipulations while preserving object identity.

High-Resolution Image Synthesis with Latent Diffusion Models

These latent diffusion models achieve new state of the art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including unconditional image generation, text-to-image synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.

Hierarchical Text-Conditional Image Generation with CLIP Latents

It is shown that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity, and the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion.

Vector Quantized Diffusion Model for Text-to-Image Synthesis

    Shuyang GuDong Chen B. Guo
    Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
This method is based on a vector quantized variational autoencoder whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM), and it is found that this latent-space method is well-suited for text-to-image generation tasks.

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

This work explores diffusion models for the problem of text-conditional image synthesis and compares two different guidance strategies: CLIP guidance and classifier-free guidance, finding that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.

Variational Diffusion Models

A family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks are introduced, and it is shown how to use the model as part of a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum.

D2C: Diffusion-Denoising Models for Few-shot Conditional Generation

D2C uses a learned diffusion-based prior over the latent representations to improve generation and contrastive self-supervised learning to improve representation quality, and achieves superior performance over state-of-the-art VAEs and diffusion models.

Retrieval Augmented Classification for Long-Tail Visual Recognition

    Alex LongWei Yin A. Hengel
    Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
Retrieval Augmented Classification is introduced, a generic approach to augmenting standard image classification pipelines with an explicit retrieval module that learns a high level of accuracy on tail classes and is applied to the problem of long-tail classification.

ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

The resulting autoregressive ImageBART model can solve free-form image inpainting and, in the case of conditional models, local, text-guided image modification without requiring mask-specific training and can take unrestricted, user-provided masks into account to perform local image editing.

RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval

This work aims to synthesize images from scene description with retrieved patches as reference with a differentiable retrieval module, which can make the entire pipeline end-to-end trainable, enabling the learning of better feature embedding for retrieval.
...