• Corpus ID: 247026092

GroupViT: Semantic Segmentation Emerges from Text Supervision

  title={GroupViT: Semantic Segmentation Emerges from Text Supervision},
  author={Jiarui Xu and Shalini De Mello and Sifei Liu and Wonmin Byeon and Thomas Breuel and Jan Kautz and X. Wang},
Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens im-plicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping… 
Weakly-supervised segmentation of referring expressions
Text grounded semantic SEGgmentation (TSEG) is proposed that learns segmentation masks directly from image-level referring expressions without pixel-level annotations and demonstrates promising results for weakly-supervised referring expression segmentation on the challenging PhraseCut and RefCOCO datasets.
SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation, is introduced and is able to segment complex dynamic scenes recorded with moving cameras, containing both static and moving objects of diverse appearance on naturalistic backgrounds, without the need for segmentation supervision.
ReCo: Retrieve and Co-segment for Zero-shot Transfer
This work leverages the retrieval abilities of one language-image pretrained model, CLIP, to dynamically curate training sets from unlabelled images for arbitrary collections of concept names, and leverage the robust correspondences offered by modern image representations to co-segment entities among the resulting collections.
Clustering as Attention: Unified Image Segmentation with Hierarchical Clustering
This work proposes a hierarchical clustering-based image segmentation scheme for deep neural networks, called HCFormer, which removes the pixel decoder from conventional segmentation models and simplifies the segmentation pipeline, resulting in improved segmentation accuracies and interpretability.


The Pascal Visual Object Classes (VOC) Challenge
The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
The results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Training data-efficient image transformers & distillation through attention
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
YFCC100M: the new data in multimedia research
This publicly available curated dataset of almost 100 million photos and videos is free and legal for all.
A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model
This paper refuses the use of the prevalent one-stage FCN based framework, and advocates a two-stage semantic segmentation framework, with the first stage extracting generalizable mask proposals and the second stage leveraging an image based CLIP model to perform zero-shot classification on the masked image crops which are generated in the first stages.
Open-Vocabulary Image Segmentation
This work is the first to perform zero-shot transfer on holdout segmentation datasets, and finds the mask representations are the key to support learning from captions, making it possible to scale up the dataset and vocabulary sizes.
Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples
This paper proposes a novel approach for creating semantic segmentation masks for every object, without the need for training segmentation networks or seeing any segmentation mask, and is shown quantitatively and qualitatively to outperform methods that use a similar amount of supervision.
DenseCLIP: Extract Free Dense Labels from CLIP
The finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation, specifically in semantic segmentation of Contrastive Language-Image Pre-training models.