StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation

  title={StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation},
  author={Umut Kocasari and Alara Dirik and Mert Tiftikci and Pinar Yanardag},
  journal={2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
Discovering meaningful directions in the latent space of GANs to manipulate semantic attributes typically requires large amounts of labeled data. Recent work aims to overcome this limitation by leveraging the power of Contrastive Language-Image Pre-training (CLIP), a joint text-image model. While promising, these methods require several hours of preprocessing or training to achieve the desired manipulations. In this paper, we present StyleMC, a fast and efficient method for text-driven image… 

Bridging CLIP and StyleGAN through Latent Alignment for Image Editing

This paper achieves inference-time optimization-free diverse manipulation direction mining by bridging CLIP and StyleGAN through Latent Alignment (CSLA) and can achieve GAN inversion, text-to-image generation and text-driven image manipulation.

clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP

We introduce a new method to efficiently create text-to-image models from a pretrained CLIP and StyleGAN. It enables text driven sampling with an existing generative model without any external data or

Rank in Style: A Ranking-based Approach to Find Interpretable Directions

A method for automatically determining the most successful and relevant text-based edits using a pre-trained StyleGAN model and a ranking method that identifies the most relevant and successful edits based on a list of keywords is proposed.

StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

The final model, StyleGAN-XL, sets a new state-of-the-art on large-scale image synthesis and is the first to generate images at a resolution of 10242 at such a dataset scale.

CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Image Manipulation

This paper introduces CLIP projection-augmentation embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation and quantitatively and qualitatively demonstrates that PAE facilitates a more disentangled, interpretable, and controllable image manipulation with state-of-the-art quality and accuracy.

Text and Image Guided 3D Avatar Generation and Manipulation

This work proposes a novel 3D manipulation method that can manipulate both the shape and texture of the model using text or image-based prompts such as ‘a young face’ or ’a surprised face�’, and leverage the power of Contrastive Language-Image Pre-training (CLIP) model and a pre-trained 3D GAN model designed to generate face avatars to manipulate meshes.

Referring Object Manipulation of Natural Images with Conditional Classifier-Free Guidance

This work proposes a conditional classifier-free guidance scheme to better guide the diffusion process along the direction from the referring expression to the target prompt, and shows that the proposed framework can serve as a simple but strong baseline for referring object manipulation.

StyleGAN-Human: A Data-Centric Odyssey of Human Generation

This work takes a data-centric perspective and investigates multiple critical aspects in “data engineering”, which it believes would complement the current practice and improve the generation quality with rare face poses compared to the long-tailed counterpart.

PaintInStyle: One-Shot Discovery of Interpretable Directions by Painting

This work proposes a framework that finds a specific manipulation direction using only a single simple sketch drawn on an image and performs image manipulations comparable with state-of-the-art methods.

ClipFace: Text-guided Editing of Textured 3D Morphable Models

A neural network is proposed that predicts both texture and expression latent codes of the morphable model of faces, to enable high-quality texture generation for 3D faces by adversarial self-supervised training, guided by differentiable rendering against collections of real RGB images.



TediGAN: Text-Guided Diverse Image Generation and Manipulation

This work proposes TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions, and proposes the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions.

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

This work explores leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort.

Analyzing and Improving the Image Quality of StyleGAN

This work redesigns the generator normalization, revisit progressive growing, and regularize the generator to encourage good conditioning in the mapping from latent codes to images, and thereby redefines the state of the art in unconditional image modeling.

ManiGAN: Text-Guided Image Manipulation

A novel generative adversarial network (ManiGAN), which contains two key components: text-image affine combination module (ACM) and detail correction module (DCM), which selects image regions relevant to the given text and then correlates the regions with corresponding semantic words for effective manipulation.

Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation

A new word-level discriminator is proposed, which provides the generator with fine-grained training feedback at word- level, to facilitate training a lightweight generator that has a small number of parameters, but can still correctly focus on specific visual attributes of an image, and then edit them without affecting other contents that are not described in the text.

Designing an encoder for StyleGAN image manipulation

This paper carefully study the latent space of StyleGAN, the state-of-the-art unconditional generator, and suggests two principles for designing encoders in a manner that allows one to control the proximity of the inversions to regions that StyleGAN was originally trained on.

Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation

We present a generic image-to-image translation framework, pixel2style2pixel (pSp). Our pSp framework is based on a novel encoder network that directly generates a series of style vectors which are

Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language

The text-adaptive generative adversarial network (TAGAN) is proposed to generate semantically manipulated images while preserving text-irrelevant contents of the original image.

StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation

The latent style space of Style-GAN2, a state-of-the-art architecture for image generation, is explored and StyleSpace, the space of channel-wise style parameters, is shown to be significantly more disentangled than the other intermediate latent spaces explored by previous works.

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

A new method for synthesizing high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks (conditional GANs) is presented, which significantly outperforms existing methods, advancing both the quality and the resolution of deep image synthesis and editing.