Dynamic Graph Attention for Referring Expression Comprehension

@article{Yang2019DynamicGA,
  title={Dynamic Graph Attention for Referring Expression Comprehension},
  author={Sibei Yang and Guanbin Li and Yizhou Yu},
  journal={2019 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2019},
  pages={4643-4652}
}
Referring expression comprehension aims to locate the object instance described by a natural language referring expression in an image. This task is compositional and inherently requires visual reasoning on top of the relationships among the objects in the image. Meanwhile, the visual reasoning process is guided by the linguistic structure of the referring expression. However, existing approaches treat the objects in isolation or only explore the first-order relationships between objects… Expand
Graph-Structured Referring Expression Reasoning in the Wild
TLDR
A scene graph guided modular network (SGMN), which performs reasoning over a semantic graph and a scene graph with neural modules under the guidance of the linguistic structure of the expression, which significantly outperforms existing state-of-the-art algorithms on the new Ref-Reasoning dataset. Expand
Modular Graph Attention Network for Complex Visual Relational Reasoning
TLDR
This paper considers reasoning on complex referring expression comprehension (c-REF) task that seeks to localise the target objects in an image guided by complex queries, and proposes a novel Modular Graph Attention Network (MGA-Net), which mimics the human language understanding mechanism. Expand
Multi-level expression guided attention network for referring expression comprehension
TLDR
A novel model, termed Multi-level Expression Guided Attention network (MEGA-Net), which contains a multi-level visual attention schema guided by the expression representations in different levels, which allows generating the discriminative region features and helps to locate the related regions accurately. Expand
Mutatt: Visual-Textual Mutual Guidance For Referring Expression Comprehension
TLDR
This paper argues that for REC the referring expression and the target region are semantically correlated and subject, location and relationship consistency exist between vision and language, and proposes a novel approach called MutAtt to construct mutual guidance between vision-language consistency, which treats vision andlanguage equally thus yields compact information matching. Expand
Understanding Synonymous Referring Expressions via Contrastive Features
TLDR
This work develops an end-to-end trainable framework to learn contrastive features on the image and object instance levels, where features extracted from synonymous sentences to describe the same object should be closer to each other after mapping to the visual domain. Expand
Bottom-Up Shift and Reasoning for Referring Image Segmentation
TLDR
Experimental results demonstrate that the proposed method, consisting of BUS and BIAR modules, can not only consistently surpass all existing state-of-the-art algorithms across common benchmark datasets but also visualize interpretable reasoning steps for stepwise segmentation. Expand
Joint Visual Grounding with Language Scene Graphs
TLDR
The missing-annotation problem is alleviated and the joint reasoning is enabled by leveraging the language scene graph which covers both labeled referent and unlabeled contexts (other objects, attributes, and relationships). Expand
Cross-Modal Progressive Comprehension for Referring Segmentation
TLDR
A novel and effective Cross-Modal Progressive Comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and aCMPC-V (Video) module to improve referring image and video segmentation models. Expand
A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension
  • Yue Liao, Si Liu, +4 authors Bo Li
  • Computer Science
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
TLDR
A novel Realtime Cross-modality Correlation Filtering method (RCCF) that reformulates the referring expression comprehension as a correlation filtering process and achieves leading performance in RefClef, RefCOCO, Ref COCO+ and RefC OCOg benchmarks. Expand
A multi-scale language embedding network for proposal-free referring expression comprehension
TLDR
This paper proposes a multi-scale language embedding network for REC, which adopts the proposal-free structure, which directly feeds fused visual-language features into a detection head to predict the bounding box of the target. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks
TLDR
A graph-based, language-guided attention mechanism that represents inter-object relationships, and properties with a flexibility and power impossible with competing approaches, and enables the comprehension decision to be visualizable and explainable. Expand
Referring Expression Generation and Comprehension via Attributes
TLDR
The role of attributes is explored by incorporating them into both referring expression generation and comprehension byTrain an attribute learning model from visual objects and their paired descriptions, thus expressions are generated driven by both attributes and the previous words. Expand
Cross-Modal Relationship Inference for Grounding Referring Expressions
TLDR
A Cross-Modal Relationship Extractor (CMRE) is proposed to adaptively highlight objects and relationships, that have connections with a given expression, with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph. Expand
Modeling Relationships in Referential Expressions with Compositional Modular Networks
TLDR
This paper presents a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene. Expand
Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries
TLDR
A unified framework, the ParalleL AttentioN (PLAN) network, to discover the object in an image that is being referred to in variable length natural expression descriptions, from short phrases query to long multi-round dialogs is proposed. Expand
MAttNet: Modular Attention Network for Referring Expression Comprehension
TLDR
This work proposes to decompose expressions into three modular components related to subject appearance, location, and relationship to other objects, which allows for flexibly adapt to expressions containing different types of information in an end-to-end framework. Expand
Comprehension-Guided Referring Expressions
TLDR
A comprehension module trained on human-generated expressions serves as a differentiable proxy of human evaluation, providing training signal to the generation module, and it is shown that both approaches lead to improved referring expression generation on multiple benchmark datasets. Expand
CLEVR-Ref+: Diagnosing Visual Reasoning With Referring Expressions
Referring object detection and referring image segmentation are important tasks that require joint understanding of visual information and natural language. Yet there has been evidence that currentExpand
Modeling Context in Referring Expressions
TLDR
This work focuses on incorporating better measures of visual context into referring expression models and finds that visual comparison to other objects within an image helps improve performance significantly. Expand
Visual Grounding via Accumulated Attention
TLDR
The A-ATT mechanism can circularly accumulate the attention for useful information in image, query, and objects, while the noises are ignored gradually and the experimental results show the superiority of the proposed method in term of accuracy. Expand
...
1
2
3
4
...