MAttNet: Modular Attention Network for Referring Expression Comprehension

@article{Yu2018MAttNetMA,
  title={MAttNet: Modular Attention Network for Referring Expression Comprehension},
  author={Licheng Yu and Zhe L. Lin and Xiaohui Shen and Jimei Yang and Xin Lu and Mohit Bansal and Tamara L. Berg},
  journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2018},
  pages={1307-1315}
}
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention… Expand
Multi-level expression guided attention network for referring expression comprehension
TLDR
A novel model, termed Multi-level Expression Guided Attention network (MEGA-Net), which contains a multi-level visual attention schema guided by the expression representations in different levels, which allows generating the discriminative region features and helps to locate the related regions accurately. Expand
Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks
TLDR
A graph-based, language-guided attention mechanism that represents inter-object relationships, and properties with a flexibility and power impossible with competing approaches, and enables the comprehension decision to be visualizable and explainable. Expand
A multi-scale language embedding network for proposal-free referring expression comprehension
TLDR
This paper proposes a multi-scale language embedding network for REC, which adopts the proposal-free structure, which directly feeds fused visual-language features into a detection head to predict the bounding box of the target. Expand
Cascade Grouped Attention Network for Referring Expression Segmentation
TLDR
A Cascade Grouped Attention Network with two innovative designs: Cascade Grouping Attention (CGA) and Instance-level Attention (ILA) loss that can achieve the high efficiency of one-stage RES while possessing a strong reasoning ability comparable to the two-stage methods. Expand
Attribute-Guided Attention for Referring Expression Generation and Comprehension
TLDR
In this work, an attribute-guided attention module is proposed as a bridging part to link the counterparts in visual representation and textual expression in referring expression. Expand
Referring Expression Object Segmentation with Caption-Aware Consistency
TLDR
This work proposes an end-to-end trainable comprehension network that consists of the language and visual encoders to extract feature representations from both domains and introduces the spatial-aware dynamic filters to transfer knowledge from text to image, and effectively capture the spatial information of the specified object. Expand
Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention
TLDR
This paper presents a proposal-free one-stage (PFOS) model that is able to regress the region-of-interest from the image, based on a textual query, in an end-toend manner and achieves the state- of-the-art performance on four referring expression datasets with higher efficiency, comparing to previous best one- stage and two-stage methods. Expand
Understanding Synonymous Referring Expressions via Contrastive Features
TLDR
This work develops an end-to-end trainable framework to learn contrastive features on the image and object instance levels, where features extracted from synonymous sentences to describe the same object should be closer to each other after mapping to the visual domain. Expand
Cross-Modal Self-Attention Network for Referring Image Segmentation
TLDR
A cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features and a gated multi-level fusion module to selectively integrateSelf-attentive cross- modal features corresponding to different levels in the image. Expand
Vision-Language Transformer and Query Generation for Referring Segmentation
  • Henghui Ding, Chang Liu, Suchen Wang, Xudong Jiang
  • Computer Science
  • ArXiv
  • 2021
TLDR
Transformer and multi-head attention are introduced and a Query Generation Module is proposed, which produces multiple sets of queries with different attention weights that represent the diversified comprehensions of the language expression from different aspects. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 37 REFERENCES
Modeling Relationships in Referential Expressions with Compositional Modular Networks
TLDR
This paper presents a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene. Expand
Comprehension-Guided Referring Expressions
TLDR
A comprehension module trained on human-generated expressions serves as a differentiable proxy of human evaluation, providing training signal to the generation module, and it is shown that both approaches lead to improved referring expression generation on multiple benchmark datasets. Expand
Image Captioning and Visual Question Answering Based on Attributes and External Knowledge
TLDR
A visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions and allows questions to be asked where the image alone does not contain the information required to select the appropriate answer. Expand
Image Captioning with Semantic Attention
TLDR
This paper proposes a new algorithm that combines top-down and bottom-up approaches to natural language description through a model of semantic attention, and significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics. Expand
Segmentation from Natural Language Expressions
TLDR
An end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information is proposed that can produce quality segmentation output from the natural language expression, and outperforms baseline methods by a large margin. Expand
Learning to Reason: End-to-End Module Networks for Visual Question Answering
TLDR
End-to-End Module Networks are proposed, which learn to reason by directly predicting instance-specific network layouts without the aid of a parser, and achieve an error reduction of nearly 50% relative to state-of-theart attentional approaches. Expand
Learning Two-Branch Neural Networks for Image-Text Matching Tasks
TLDR
This paper investigates two-branch neural networks for learning the similarity between image-sentence matching and region-phrase matching, and proposes two network structures that produce different output representations. Expand
Referring Expression Generation and Comprehension via Attributes
TLDR
The role of attributes is explored by incorporating them into both referring expression generation and comprehension byTrain an attribute learning model from visual objects and their paired descriptions, thus expressions are generated driven by both attributes and the previous words. Expand
Modeling Context in Referring Expressions
TLDR
This work focuses on incorporating better measures of visual context into referring expression models and finds that visual comparison to other objects within an image helps improve performance significantly. Expand
Grounding of Textual Phrases in Images by Reconstruction
TLDR
A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets. Expand
...
1
2
3
4
...