Open Vocabulary Scene Parsing

@article{Zhao2017OpenVS,
  title={Open Vocabulary Scene Parsing},
  author={Hang Zhao and Xavier Puig and Bolei Zhou and Sanja Fidler and Antonio Torralba},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={2021-2029}
}
Recognizing arbitrary objects in the wild has been a challenging problem due to the limitations of existing classification models and datasets. In this paper, we propose a new task that aims at parsing scenes with a large and open vocabulary, and several evaluation metrics are explored for this problem. Our approach is a joint image pixel and word concept embeddings framework, where word concepts are connected by semantic relations. We validate the open vocabulary prediction ability of our… 

Figures and Tables from this paper

Semantic Understanding of Scenes Through the ADE20K Dataset
TLDR
This work presents a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts, and shows that the networks trained on this dataset are able to segment a wide variety of scenes and objects.
Semantic combined network for zero-shot scene parsing
TLDR
A novel framework called semantic combined network (SCN), which aims at learning a scene parsing model only from the images of the seen classes while targeting on the unseen ones, and can perform well on both zero-shot scene parsing (ZSSP) and generalised ZSSP settings based on several state-of-the-art scenes parsing architectures.
Open-Vocabulary Image Segmentation
TLDR
This work is the first to perform zero-shot transfer on holdout segmentation datasets, and finds the mask representations are the key to support learning from captions, making it possible to scale up the dataset and vocabulary sizes.
Concept Mask: Large-Scale Segmentation from Semantic Concepts
TLDR
This work forms semantic segmentation as a problem of image segmentation given a semantic concept, and proposes a novel system which can potentially handle an unlimited number of concepts, including objects, parts, stuff, and attributes.
Zero-Shot Semantic Segmentation via Variational Mapping
TLDR
This paper presents zero-shot semantic segmentation, where a model that has never seen the target class during training is presented, and proposes variational mapping, which facilitates effective learning by mapping the class label embedding vectors from the semantic space to the visual space.
Scene Parsing with Deep Features and Per-Exemplar Detectors
TLDR
An approach that combines the convolution neural network with per-exemplar detectors for scene parsing and integrates multiple cues into a Markov Random Field (MRF) framework is presented.
Novel categories : Base categories toy : 0 . 77 blue toy : 0 . 74 toy elephant : 0
TLDR
The ViLD method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student), which uses the teacher model to encode category texts and image regions of object proposals and can directly transfer to other datasets without finetuning.
PromptDet: Expand Your Detector Vocabulary with Uncurated Images
TLDR
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations, and to validate the necessity of the proposed components, conducts extensive experiments on the challenging LVIS and MS-COCO dataset.
Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation
TLDR
This work addresses the problem of generalized zero-shot semantic segmentation (GZS3) predicting pixel-wise semantic labels for seen and unseen classes by exploiting semantic prototypes as a nearest-neighbor (NN) classifier and introduces boundary-aware regression and semantic consistency losses to learn discriminative features.
The devil is in the labels: Semantic segmentation from sentences
TLDR
An approach to semantic segmentation that achieves state-of-the-art supervised performance when applied in a zero-shot setting, and that the method enables impressive performance improvements in downstream applications, including depth estimation and instance segmentation.
...
...

References

SHOWING 1-10 OF 38 REFERENCES
Semantic Understanding of Scenes Through the ADE20K Dataset
TLDR
This work presents a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts, and shows that the networks trained on this dataset are able to segment a wide variety of scenes and objects.
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
Order-Embeddings of Images and Language
TLDR
A general method for learning ordered representations is introduced, and it is shown that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.
Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions
TLDR
A new model is presented that can classify unseen categories from their textual description and takes advantage of the architecture of CNNs and learn features at different layers, rather than just learning an embedding space for both modalities, as is common with existing approaches.
The Cityscapes Dataset
TLDR
This paper aims to provide a large and diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations in addition to a larger set of weakly annotated frames, an order of magnitude larger than similar previous attempts.
Zero-Shot Learning by Convex Combination of Semantic Embeddings
TLDR
A simple method for constructing an image embedding system from any existing image classifier and a semantic word embedding model, which contains the $\n$ class labels in its vocabulary is proposed, which outperforms state of the art methods on the ImageNet zero-shot learning task.
Deep Metric Learning Using Triplet Network
TLDR
This paper proposes the triplet network model, which aims to learn useful representations by distance comparisons, and demonstrates using various datasets that this model learns a better representation than that of its immediate competitor, the Siamese network.
Zero-Shot Learning Through Cross-Modal Transfer
TLDR
This work introduces a model that can recognize objects in images even if no training data is available for the object class, and uses novelty detection methods to differentiate unseen classes from seen classes.
Attribute-Based Classification for Zero-Shot Visual Object Categorization
We study the problem of object recognition for categories for which we have no training examples, a task also called zero--data or zero-shot learning. This situation has hardly been studied in
Deep Visual-Semantic Alignments for Generating Image Descriptions
  • A. Karpathy, Li Fei-Fei
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
TLDR
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.
...
...