eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

  title={eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers},
  author={Yogesh Balaji and Seungjun Nah and Xun Huang and Arash Vahdat and Jiaming Song and Karsten Kreis and Miika Aittala and Timo Aila and Samuli Laine and Bryan Catanzaro and Tero Karras and Ming-Yu Liu},
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis, demonstrating complex text comprehension and outstanding zero-shot generalization. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the… 

Sketch-Guided Text-to-Image Diffusion Models

This work introduces a universal approach to guide a pretrained text-to-image diffusion model, with a spatial map from another domain (e.g., sketch) during inference time, revealing a robust and ex-pressive way to generate images that follow the guidance of a sketch of arbitrary style or domain.

GLIGEN: Open-Set Grounded Text-to-Image Generation

This work proposes GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text- to-image diffusion models by enabling them to also be conditioned on grounding inputs.

Will Large-scale Generative Models Corrupt Future Datasets?

This paper empirically answers the question of how generated images impact the quality of future datasets and the performance of computer vision models positively or negatively by simulating contamination by generating ImageNet-scale and COCO-scale datasets using a state-of-the-art generative model and evaluating models trained on “contaminated” datasets on various tasks.

Magic3D: High-Resolution Text-to-3D Content Creation

The method, dubbed Magic3D, can create high quality 3D mesh models in 40 minutes, which is 2 × faster than DreamFusion (reportedly taking 1.5 hours on average), while also achieving higher resolution, and provides users with new ways to control 3D synthesis.

Listen, denoise, action! Audio-driven motion synthesis with diffusion models

Diffusion models are shown to be an excellent model for synthesising human motion that co-occurs with audio, for example co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description.

SceneComposer: Any-Level Semantic Image Synthesis

Experimental results show that the proposed method can generate high-quality images following the layout at given precision, and compares favorably against existing methods.

Rickrolling the Artist: Injecting Invisible Backdoors into Text-Guided Image Generation Models

Backdoor attacks against text-guided generative models are introduced and it is demonstrated that their text encoders pose a major tampering risk.

EDICT: Exact Diffusion Inversion via Coupled Transformations

Exact Diffusion Inversion via Coupled Transformations (EDICT), an inversion method that draws inspiration from affine coupling layers, enables mathematically exact inversion of real and model-generated images by maintaining two coupled noise vectors which are used to invert each other in an alternating fashion.

Journey to the BAOAB-limit: finding effective MCMC samplers for score-based models

This work explores MCMC sampling algorithms that operate at a single noise level, yet synthesize images with acceptable sample quality, and begins to approach competitive sample quality without using scores at large noise levels.

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

This paper explores an alternative method for 3D object generation which produces 3D models in only 1-2 minutes on a single GPU, and is one to two orders of mag-nitude faster to sample from, offering a practical trade-off for some use cases.