Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

  title={Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors},
  author={Ryan Burgert and Kanchana Ranasinghe and Xiang Li and Michael S. Ryoo},
Recent diffusion-based generative models combined with vision-language models are capable of creating realistic images from natural language prompts. While these models are trained on large internet-scale datasets, such pre-trained models are not directly introduced to any semantic localization or grounding. Most current approaches for localization or grounding rely on human-annotated localization information in the form of bounding boxes or segmentation masks. The exceptions are a few… 



Language-driven Semantic Segmentation

It is demonstrated that the LSeg approach achieves highly competitive zero-shot performance compared to existing zero- and few-shot semantic segmentation methods, and even matches the accuracy of traditional segmentation algorithms when a fixed label set is provided.

LAION-5B: An open large-scale dataset for training next generation image-text models

This work presents LAION-5B a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language, and shows successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discusses further experiments enabled with an openly available dataset of this scale.

CRIS: CLIP-Driven Referring Image Segmentation

This paper designs a vision-language decoder to propagate fine-grained semantic information from textual representations to each pixel-level activation, which promotes consistency between the two modalities and presents text-to-pixel contrastive learning.

Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples

This paper proposes a novel approach for creating semantic segmentation masks for every object, without the need for training segmentation networks or seeing any segmentation mask, and is shown quantitatively and qualitatively to outperform methods that use a similar amount of supervision.

Segmentation from Natural Language Expressions

An end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information is proposed that can produce quality segmentation output from the natural language expression, and outperforms baseline methods by a large margin.

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss, and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.

Perceptual Grouping in Vision-Language Models

This work examines how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery, and proposes a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.

Dynamic Multimodal Instance Segmentation guided by natural language queries

The problem of segmenting an object given a natural language expression that describes it is addressed and a novel method that integrates linguistic and visual information in the channel dimension and the intermediate information generated when downsampling the image is proposed, so that detailed segmentations can be obtained.

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals

A two-step framework that adopts a predetermined mid-level prior in a contrastive optimization objective to learn pixel embeddings and argues about the importance of having a prior that contains information about objects, or their parts, and discusses several possibilities to obtain such a prior in an unsupervised manner.

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.