• Corpus ID: 236447387

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

  title={Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP},
  author={Daniil Pakhomov and Sanchit Hira and Narayani Wagle and Kemar E. Green and Nassir Navab},
We introduce a method that allows to automatically segment images into semantically meaningful regions without human supervision. The derived regions are consistent across different images and coincide with human-defined semantic classes on some datasets. The method is particularly useful in cases where the labelling and definition of semantic regions pose a challenge for humans. In our work, we use pretrained StyleGAN2 [8] generative model: clustering in the feature space of the generative… 
CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation
This work presents a simple yet effective method for zeroshot text-to-shape generation based on a two-stage training process, which only depends on an unlabelled shape dataset and a pre-trained image-text network such as CLIP.
CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP
CLOOB consistently outperforms CLIP at zero-shot transfer learning across all considered architectures and datasets and is compared after learning on the Conceptual Captions and the YFCC dataset with respect to their zero- shot transfer learning performance on other datasets.
LAFITE: Towards Language-Free Training for Text-to-Image Generation
The first work to train text-to-image generation models without any text data is proposed, which leverages the well-aligned multimodal semantic space of the powerful pre-trained CLIP model and can be applied in fine-tuning pretrained models, which saves both training time and cost.


DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort
This work introduces DatasetGAN: an automatic procedure to generate massive datasets of high-quality semantically segmented images requiring minimal human effort and is on par with fully supervised methods, which in some cases require as much as 100x more annotated data as the method.
Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization
This paper proposes a novel framework for discriminative pixel-level tasks using a generative model of both images and labels that captures the joint image-label distribution and is trained efficiently using a large set of un-labeled images supplemented with only few labeled ones.
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models.
The Cityscapes Dataset for Semantic Urban Scene Understanding
This work introduces Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling, and exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity.
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
This work explores leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort.
Pyramid Scene Parsing Network
This paper exploits the capability of global context information by different-region-based context aggregation through the pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet) to produce good quality results on the scene parsing task.
Analyzing and Improving the Image Quality of StyleGAN
This work redesigns the generator normalization, revisit progressive growing, and regularize the generator to encourage good conditioning in the mapping from latent codes to images, and thereby redefines the state of the art in unconditional image modeling.
MaskGAN: Towards Diverse and Interactive Facial Image Manipulation
This work proposes a novel framework termed MaskGAN, enabling diverse and interactive face manipulation, and finds that semantic masks serve as a suitable intermediate representation for flexible face manipulation with fidelity preservation.
A Style-Based Generator Architecture for Generative Adversarial Networks
  • Tero Karras, S. Laine, Timo Aila
  • Computer Science, Mathematics
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
An alternative generator architecture for generative adversarial networks is proposed, borrowing from style transfer literature, that improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation.