Recurrent Multimodal Interaction for Referring Image Segmentation

@article{Liu2017RecurrentMI,
  title={Recurrent Multimodal Interaction for Referring Image Segmentation},
  author={Chenxi Liu and Zhe L. Lin and Xiaohui Shen and Jimei Yang and Xin Lu and Alan Loddon Yuille},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={1280-1289}
}
In this paper we are interested in the problem of image segmentation given natural language descriptions, i.e. referring expressions. Existing works tackle this problem by first modeling images and sentences independently and then segment images by combining these two types of representations. We argue that learning word-to-image interaction is more native in the sense of jointly modeling two modalities for the image segmentation task, and we propose convolutional multimodal LSTM to encode the… 
Attentive Excitation and Aggregation for Bilingual Referring Image Segmentation
TLDR
A bilingual referring image segmentation model that outperforms previous methods on four English and Chinese benchmarks is proposed and a Cross-Level Attentive Fusion module is proposed to fuse multilevel features gated by language information.
Recurrent Instance Segmentation using Sequences of Referring Expressions
TLDR
This work proposes a deep neural network with recurrent layers that output a sequence of binary masks, one for each referring expression provided by the user, and uses off-the-shelf architectures to encode both the image and the referring expressions.
Referring Image Segmentation via Recurrent Refinement Networks
TLDR
Recurrent Refinement Network (RRN) is proposed that takes pyramidal features as input to refine the segmentation mask progressively and outperforms multiple baselines and state-of-the-art models.
Dual Convolutional LSTM Network for Referring Image Segmentation
TLDR
A dual convolutional LSTM (ConvLSTM) network is proposed to tackle referring image segmentation, which is a problem at the intersection of computer vision and natural language understanding.
Multi-granularity Multimodal Feature Interaction for Referring Image Segmentation
TLDR
This work proposes to conduct multi-granularity multimodal feature interaction by introducing a Word-G Granularity Feature Modulation (WGFM) module and a Sentence-Granularity Context Extraction (SGCE) module, which can be complementary in feature alignment and obtain a comprehensive understanding of the input image and referring expression.
Comprehensive Multi-Modal Interactions for Referring Image Segmentation
TLDR
This work investigates Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description, and proposes a novel Hierarchical Cross-Modal Aggregation Module (HCAM), where linguistic features facilitate the exchange of contextual information across the visual hierarchy.
Global Context Enhanced Multi-modal Fusion for Referring Image Segmentation
TLDR
A global fusion network (GFNet), which is composed of visual guided global fusion module and language guidedglobal fusion module that outperforms state-of-the-art methods in referring image segmentation and introduces a channel-wise self-gate on visual-language concatenated features.
Hierarchical Interaction Network for Video Object Segmentation from Referring Expressions
TLDR
This paper presents an end-to-end hierarchical interaction network (HINet) for the VOSRE problem, which leverages the feature pyramid produced by the visual encoder to generate multiple levels of multi-modal features and extracts signals of moving objects from optical input.
Referring Expression Object Segmentation with Caption-Aware Consistency
TLDR
This work proposes an end-to-end trainable comprehension network that consists of the language and visual encoders to extract feature representations from both domains and introduces the spatial-aware dynamic filters to transfer knowledge from text to image, and effectively capture the spatial information of the specified object.
CRIS: CLIP-Driven Referring Image Segmentation
TLDR
This paper proposes an end-to-end CLIP-Driven Referring Image Segmentation framework (CRIS), and designs a vision-language decoder to propagate fine-grained semantic information from textual representations to each pixel-level activation, which promotes consistency between the two modalities.
...
...

References

SHOWING 1-10 OF 45 REFERENCES
Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions
TLDR
This paper explores how existing large scale vision-only and text-only datasets can be utilized to train models for image segmentation from referring expressions, and proposes a method to address this problem and shows in experiments that this method can help this joint vision and language modeling task with vision- only and text only data and outperforms previous results.
MAT: A Multimodal Attentive Translator for Image Captioning
TLDR
This work presents a sequence-to-sequence recurrent neural networks (RNN) model for image caption generation that surpasses the state-of-the-art methods in all metrics following the dataset splits of previous work.
Segmentation from Natural Language Expressions
TLDR
An end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information is proposed that can produce quality segmentation output from the natural language expression, and outperforms baseline methods by a large margin.
Generation and Comprehension of Unambiguous Object Descriptions
TLDR
This work proposes a method that can generate an unambiguous description of a specific object or region in an image and which can also comprehend or interpret such an expression to infer which object is being described, and shows that this method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene.
Modeling Context in Referring Expressions
TLDR
This work focuses on incorporating better measures of visual context into referring expression models and finds that visual comparison to other objects within an image helps improve performance significantly.
Grounding of Textual Phrases in Images by Reconstruction
TLDR
A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets.
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
TLDR
The m-RNN model directly models the probability distribution of generating a word given previous words and an image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
Fully Convolutional Networks for Semantic Segmentation
TLDR
It is shown that convolutional networks by themselves, trained end- to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation.
Long-term recurrent convolutional networks for visual recognition and description
TLDR
A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
...
...