• Corpus ID: 236987044

Vision-Language Transformer and Query Generation for Referring Segmentation

  title={Vision-Language Transformer and Query Generation for Referring Segmentation},
  author={Henghui Ding and Chang Liu and Suchen Wang and Xudong Jiang},
In this work, we address the challenging task of referring segmentation. The query expression in referring segmentation typically indicates the target object by describing its relationship with others. Therefore, to find the target one among all instances in the image, the model must have a holistic understanding of the whole image. To achieve this, we reformulate referring segmentation as a direct attention problem: finding the region in the image where the query language expression is most… 

Figures and Tables from this paper

Language as Queries for Referring Video Object Segmentation
This work proposes a simple and unified framework built upon Transformer, termed ReferFormer, which significantly outperforms the previous methods by a large margin and greatly simplifies the pipeline and the endto-end framework.
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
This work shows that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network, and surpasses the previous state-of-the-art methods on RefCOCO, RefC OCO+, and G-Ref by large margins.
CRIS: CLIP-Driven Referring Image Segmentation
This paper designs a vision-language decoder to propagate fine-grained semantic information from textual representations to each pixel-level activation, which promotes consistency between the two modalities and presents text-to-pixel contrastive learning.
MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation
MaIL is proposed, which is a more concise encoder-decoder pipeline with a MaskImage-Language trimodal encoder that unifies uni-modal feature extractors and their fusion model into a deep modality interaction encoder, facilitating sufficient feature interaction across different modalities.
Interaction via Bi-directional Graph of Semantic Region Affinity for Scene Parsing
In this work, we devote to address the challenging problem of scene parsing. It is well known that pixels in an image are highly correlated with each other, especially those from the same semantic
Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters
This work argues that the common practice of using language in a top-down manner, to direct visual attention over high-level visual features, may not be optimal, and hypothesizes that the use of language to also condition the bottom-up processing from pixels to high- level features can provide benefits to the overall performance.
Knowledge-aware Deep Framework for Collaborative Skin Lesion Segmentation and Melanoma Recognition
A novel knowledgeaware deep framework that incorporates some clinical knowledge into collaborative learning of two important melanoma diagnosis tasks, i.e., skin lesion segmentation and melanoma recognition is proposed.


Dynamic Multimodal Instance Segmentation guided by natural language queries
The problem of segmenting an object given a natural language expression that describes it is addressed and a novel method that integrates linguistic and visual information in the channel dimension and the intermediate information generated when downsampling the image is proposed, so that detailed segmentations can be obtained.
Key-Word-Aware Network for Referring Expression Image Segmentation
A key-word-aware network, which contains a query attention model and a key-words-aware visual context model, which outperforms state-of-the-art methods on two referring expression image segmentation databases.
Cross-Modal Self-Attention Network for Referring Image Segmentation
A cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features and a gated multi-level fusion module to selectively integrateSelf-attentive cross- modal features corresponding to different levels in the image.
Cascade Grouped Attention Network for Referring Expression Segmentation
A Cascade Grouped Attention Network with two innovative designs: Cascade Grouping Attention (CGA) and Instance-level Attention (ILA) loss that can achieve the high efficiency of one-stage RES while possessing a strong reasoning ability comparable to the two-stage methods.
Referring Image Segmentation via Cross-Modal Progressive Comprehension
A Cross-Modal Progressive Comprehension (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task of referring image segmentation.
MAttNet: Modular Attention Network for Referring Expression Comprehension
This work proposes to decompose expressions into three modular components related to subject appearance, location, and relationship to other objects, which allows for flexibly adapt to expressions containing different types of information in an end-to-end framework.
Segmentation from Natural Language Expressions
An end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information is proposed that can produce quality segmentation output from the natural language expression, and outperforms baseline methods by a large margin.
Linguistic Structure Guided Context Modeling for Referring Image Segmentation
A "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction and implement this scheme as a novel Linguistic Structure guided Context Modeling (LSCM) module.
Bi-Directional Relationship Inferring Network for Referring Image Segmentation
This work proposes a bi-directional relationship inferring network (BRINet) to model the dependencies of cross-modal information and demonstrates that the proposed method outperforms other state-of-the-art methods under different evaluation metrics.
Recurrent Multimodal Interaction for Referring Image Segmentation
It is argued that learning word-to-image interaction is more native in the sense of jointly modeling two modalities for the image segmentation task, and a convolutional multimodal LSTM to encode the sequential interactions between individual words, visual information, and spatial information is proposed.