Weakly-supervised segmentation of referring expressions

  title={Weakly-supervised segmentation of referring expressions},
  author={Robin Strudel and Ivan Laptev and Cordelia Schmid},
. Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions. In this work we address image segmentation from referring expressions, a problem that has so far only been addressed in a fully-supervised setting. A fully-supervised setup, however, requires pixel-wise supervision and is hard to scale given the expense of manual annotation. We therefore introduce a new task of weakly-supervised image segmentation from referring expressions and… 

Figures and Tables from this paper



Learning Pixel-Level Semantic Affinity with Image-Level Supervision for Weakly Supervised Semantic Segmentation

  • Jiwoon AhnSuha Kwak
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
On the PASCAL VOC 2012 dataset, a DNN learned with segmentation labels generated by the method outperforms previous models trained with the same level of supervision, and is even as competitive as those relying on stronger supervision.

Single-Stage Semantic Segmentation From Image Labels

  • Nikita AraslanovS. Roth
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
This work develops a segmentation-based network model and a self-supervised training scheme to train for semantic masks from image-level annotations in a single stage, and shows that despite its simplicity, this method achieves results that are competitive with significantly more complex pipelines, substantially outperforming earlier single-stage methods.

Weakly Supervised Learning of Instance Segmentation With Inter-Pixel Relations

IRNet is proposed, which estimates rough areas of individual instances and detects boundaries between different object classes and enables to assign instance labels to the seeds and to propagate them within the boundaries so that the entire areas of instances can be estimated accurately.

Railroad is not a Train: Saliency as Pseudo-pixel Supervision for Weakly Supervised Semantic Segmentation

A novel framework, namely Explicit Pseudo-pixel Supervision (EPS), which learns from pixel-level feedback by combining two weak supervisions; the image-level label provides the object identity via the localization map and the saliency map from the off-the-shelf saliency detection model offers rich boundaries.

Associating Inter-image Salient Instances for Weakly Supervised Semantic Segmentation

This paper uses an instance-level salient object detector to automatically generate salient instances (candidate objects) for training images, and proposes a graph-partitioning-based clustering algorithm that outperforms state-of-the-art weakly supervised alternatives by a large margin.

Discriminative Region Suppression for Weakly-Supervised Semantic Segmentation

This work introduces discriminative region suppression (DRS) module that is a simple yet effective method to expand object activation regions and introduces an additional learning strategy to give a self-enhancement of localization maps, named localization map refinement learning.

Segmenter: Transformer for Semantic Segmentation

This paper introduces Segmenter, a transformer model for semantic segmentation that outperforms the state of the art on both ADE20K and Pascal Context datasets and is competitive on Cityscapes.

Weakly-and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation

Expectation-Maximization (EM) methods for semantic image segmentation model training under weakly supervised and semi-supervised settings are developed and extensive experimental evaluation shows that the proposed techniques can learn models delivering competitive results on the challenging PASCAL VOC 2012 image segmentsation benchmark, while requiring significantly less annotation effort.

Cross-Modal Self-Attention Network for Referring Image Segmentation

A cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features and a gated multi-level fusion module to selectively integrateSelf-attentive cross- modal features corresponding to different levels in the image.

Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures

An end-to-end model which learns visual groundings of phrases with two types of carefully designed loss functions is introduced, which ensure complementarity among the attention masks that correspond to sibling noun phrases, and compositionality of attention masks among the children and parent phrases.