Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation

  title={Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation},
  author={Y. Ma and Huan Yang and Wenjing Wang and Jianlong Fu and Jiaying Liu},
Language-guided image generation has achieved great success nowadays by using diffusion models. However, texts can be less detailed to describe highly-specific subjects such as a particular dog or a certain car, which makes pure text-to-image generation not accurate enough to satisfy user requirements. In this work, we present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences and generates customized… 

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

FastComposer enables efficient, personalized, multi-subject text-to-image generation without fine-tuning, and proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation.

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

InstantBooth is proposed, a novel approach built upon pre-trained text-to-image models that enables instant text-guided image personalization without any test-time finetuning and can generate competitive results on unseen concepts concerning language-image alignment, image fidelity, and identity preservation while being 100 times faster.

DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation

Different from previous works that utilize an entangled embedding to denoise each image, DisenBooth instead utilizes disentangled embeddings to respectively preserve the subject identity and capture the identity-irrelevant information.

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint

Visual Chain-of-Thought Diffusion Models

This work proposes to close the gap between conditional and unconditional models using a two-stage sampling procedure, and shows that leveraging the power of conditional diffusion models on the unconditional generation task improves FID by 25-50% compared to standard unconditional generation.

Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution

  • Yiyang MaHuan YangWenhan YangJianlong FuJiaying Liu
  • Computer Science
  • 2023
The quality of SR results sampled by the proposed method with fewer steps outperforms the quality of results sampling by current methods with randomness from the same pretrained diffusion-based SR model, which means that the sampling method ``boosts'' current diffusion- based SR models without any additional training.

A Unified Prompt-Guided In-Context Inpainting Framework for Reference-based Image Manipulations

A unified Prompt-Guided In-Context inpainting (PGIC) framework is introduced, which leverages large-scale T2I models to re-formulate and solve reference-guided image manipulations to achieve significantly better performance while requiring less computation compared to other fine-tuning based approaches.

Towards Open-World Text-Guided Face Image Generation and Manipulation

A unified framework for both face image generation and manipulation that produces diverse and high-quality images with an unprecedented resolution at 1024 from multimodal inputs and supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

This work presents a new approach for "personalization" of text-to-image diffusion models, and applies it to several previously-unassailable tasks, including subject recontextualization, text-guided view synthesis, and artistic rendering, all while preserving the subject's key features.

Hierarchical Text-Conditional Image Generation with CLIP Latents

It is shown that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity, and the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion.

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

This work improves the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions, by incorporating linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T1I models.

DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis

A novel simplified text-to-image backbone which is able to synthesize high-quality images directly by one pair of generator and discriminator, a novel regularization method called Matching-Aware zero-centered Gradient Penalty and a novel fusion module which can exploit the semantics of text descriptions effectively and fuse text and image features deeply during the generation process.

DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis

The proposed DM-GAN model introduces a dynamic memory module to refine fuzzy image contents, when the initial images are not well generated, and performs favorably against the state-of-the-art approaches.

AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation

A Prompt-based Cross-Modal Generation Framework (PCM-Frame) to leverage two powerful pre-trained models, including CLIP and StyleGAN, and conducts a user study to demonstrate its superiority over the competing methods of text-to-image translation with complicated semantics.

Controllable Text-to-Image Generation

A novel controllable text-to-image generative adversarial network (ControlGAN) is proposed, which can effectively synthesise high-quality images and also control parts of the image generation according to natural language descriptions.

MirrorGAN: Learning Text-To-Image Generation by Redescription

Thorough experiments on two public benchmark datasets demonstrate the superiority of MirrorGAN over other representative state-of-the-art methods.

High-Resolution Image Synthesis with Latent Diffusion Models

These latent diffusion models achieve new state of the art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including unconditional image generation, text-to-image synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.