Modeling Relationships in Referential Expressions with Compositional Modular Networks
@article{Hu2016ModelingRI, title={Modeling Relationships in Referential Expressions with Compositional Modular Networks}, author={Ronghang Hu and Marcus Rohrbach and Jacob Andreas and Trevor Darrell and Kate Saenko}, journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2016}, pages={4418-4427} }
People often refer to entities in an image in terms of their relationships with other entities. For example, the black cat sitting under the table refers to both a black cat entity and its relationship with another table entity. Understanding these relationships is essential for interpreting and grounding such natural language expressions. Most prior work focuses on either grounding entire referential expressions holistically to one region, or localizing relationships based on a fixed set of…
Figures and Tables from this paper
305 Citations
Referring Relationships
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
An iterative model is introduced that localizes the two entities in the referring relationship by modelling predicates that connect the entities as shifts in attention from one entity to another, and it is demonstrated that this model can not only outperform existing approaches on three datasets but also that it produces visually meaningful predicate shifts, as an instance of interpretable neural networks.
Relationship-Embedded Representation Learning for Grounding Referring Expressions
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2021
A CMRE to adaptively highlight objects and relationships related to the given expression with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph and a Gated Graph Convolutional Network to compute multimodal semantic contexts.
Using Syntax to Ground Referring Expressions in Natural Images
- Computer ScienceAAAI
- 2018
GroundNet, a neural network for referring expression recognition---the task of localizing in an image the object referred to by a natural language expression, is introduced, the first to rely on a syntactic analysis of the input referring expression in order to inform the structure of the computation graph.
Cross-Modal Relationship Inference for Grounding Referring Expressions
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
A Cross-Modal Relationship Extractor (CMRE) is proposed to adaptively highlight objects and relationships, that have connections with a given expression, with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph.
Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
A new dataset for visual reasoning in context of referring expression comprehension with two main features, and a novel expression engine rendering various reasoning logics that can be flexibly combined with rich visual properties to generate expressions with varying compositionality.
Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding
- Computer ScienceACM Multimedia
- 2019
A knowledge-guided pairwise reconstruction network (KPRN) is proposed, which models the relationship between the target entity (subject) and contextual entity (object) as well as grounds these two entities.
Graph-Structured Referring Expression Reasoning in the Wild
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
A scene graph guided modular network (SGMN), which performs reasoning over a semantic graph and a scene graph with neural modules under the guidance of the linguistic structure of the expression, which significantly outperforms existing state-of-the-art algorithms on the new Ref-Reasoning dataset.
Grounding Referring Expressions in Images by Variational Context
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
A variational Bayesian method to solve the problem of complex context modeling in referring expression grounding by exploiting the reciprocal relation between the referent and context, and thereby the search space of context can be greatly reduced.
Propagating Over Phrase Relations for One-Stage Visual Grounding
- Computer Science, LinguisticsECCV
- 2020
A linguistic structure guided propagation network for one-stage phrase grounding that explicitly explores the linguistic structure of the sentence and performs relational propagation among noun phrases under the guidance of the linguistic relations between them.
MAttNet: Modular Attention Network for Referring Expression Comprehension
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
This work proposes to decompose expressions into three modular components related to subject appearance, location, and relationship to other objects, which allows for flexibly adapt to expressions containing different types of information in an end-to-end framework.
References
SHOWING 1-10 OF 36 REFERENCES
VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases
- Computer Science2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2015
This work introduces the problem of visual verification of relation phrases and developed a Visual Knowledge Extraction system called VisKE, which has been used to not only enrich existing textual knowledge bases by improving their recall, but also augment open-domain question-answer reasoning.
Modeling Context in Referring Expressions
- Computer ScienceECCV
- 2016
This work focuses on incorporating better measures of visual context into referring expression models and finds that visual comparison to other objects within an image helps improve performance significantly.
Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World
- Computer ScienceTACL
- 2013
This paper introduces Logical Semantics with Perception (LSP), a model for grounded language acquisition that learns to map natural language statements to their referents in a physical environment and finds that LSP outperforms existing, less expressive models that cannot represent relational language.
Modeling Context Between Objects for Referring Expression Understanding
- Computer ScienceECCV
- 2016
A technique that integrates context between objects to understand referring expressions is proposed, which uses an LSTM to learn the probability of a referring expression, with input features from a region and a context region.
Segmentation from Natural Language Expressions
- Computer ScienceECCV
- 2016
An end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information is proposed that can produce quality segmentation output from the natural language expression, and outperforms baseline methods by a large margin.
Grounding of Textual Phrases in Images by Reconstruction
- Computer ScienceECCV
- 2016
A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets.
Visual Relationship Detection with Language Priors
- Computer ScienceECCV
- 2016
This work proposes a model that can scale to predict thousands of types of relationships from a few examples and improves on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship.
Neural Module Networks
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural "modules" into deep networks for question answering, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.).
Visual7W: Grounded Question Answering in Images
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.
Natural Language Object Retrieval
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
Experimental results demonstrate that the SCRC model effectively utilizes both local and global information, outperforming previous baseline methods significantly on different datasets and scenarios, and can exploit large scale vision and language datasets for knowledge transfer.