Modeling Relationships in Referential Expressions with Compositional Modular Networks

  title={Modeling Relationships in Referential Expressions with Compositional Modular Networks},
  author={Ronghang Hu and Marcus Rohrbach and Jacob Andreas and Trevor Darrell and Kate Saenko},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
People often refer to entities in an image in terms of their relationships with other entities. For example, the black cat sitting under the table refers to both a black cat entity and its relationship with another table entity. Understanding these relationships is essential for interpreting and grounding such natural language expressions. Most prior work focuses on either grounding entire referential expressions holistically to one region, or localizing relationships based on a fixed set of… 

Figures and Tables from this paper

Referring Relationships

An iterative model is introduced that localizes the two entities in the referring relationship by modelling predicates that connect the entities as shifts in attention from one entity to another, and it is demonstrated that this model can not only outperform existing approaches on three datasets but also that it produces visually meaningful predicate shifts, as an instance of interpretable neural networks.

Relationship-Embedded Representation Learning for Grounding Referring Expressions

A CMRE to adaptively highlight objects and relationships related to the given expression with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph and a Gated Graph Convolutional Network to compute multimodal semantic contexts.

Using Syntax to Ground Referring Expressions in Natural Images

GroundNet, a neural network for referring expression recognition---the task of localizing in an image the object referred to by a natural language expression, is introduced, the first to rely on a syntactic analysis of the input referring expression in order to inform the structure of the computation graph.

Cross-Modal Relationship Inference for Grounding Referring Expressions

A Cross-Modal Relationship Extractor (CMRE) is proposed to adaptively highlight objects and relationships, that have connections with a given expression, with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph.

Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension

A new dataset for visual reasoning in context of referring expression comprehension with two main features, and a novel expression engine rendering various reasoning logics that can be flexibly combined with rich visual properties to generate expressions with varying compositionality.

Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding

A knowledge-guided pairwise reconstruction network (KPRN) is proposed, which models the relationship between the target entity (subject) and contextual entity (object) as well as grounds these two entities.

Graph-Structured Referring Expression Reasoning in the Wild

A scene graph guided modular network (SGMN), which performs reasoning over a semantic graph and a scene graph with neural modules under the guidance of the linguistic structure of the expression, which significantly outperforms existing state-of-the-art algorithms on the new Ref-Reasoning dataset.

Grounding Referring Expressions in Images by Variational Context

A variational Bayesian method to solve the problem of complex context modeling in referring expression grounding by exploiting the reciprocal relation between the referent and context, and thereby the search space of context can be greatly reduced.

Propagating Over Phrase Relations for One-Stage Visual Grounding

A linguistic structure guided propagation network for one-stage phrase grounding that explicitly explores the linguistic structure of the sentence and performs relational propagation among noun phrases under the guidance of the linguistic relations between them.

MAttNet: Modular Attention Network for Referring Expression Comprehension

This work proposes to decompose expressions into three modular components related to subject appearance, location, and relationship to other objects, which allows for flexibly adapt to expressions containing different types of information in an end-to-end framework.



VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases

This work introduces the problem of visual verification of relation phrases and developed a Visual Knowledge Extraction system called VisKE, which has been used to not only enrich existing textual knowledge bases by improving their recall, but also augment open-domain question-answer reasoning.

Modeling Context in Referring Expressions

This work focuses on incorporating better measures of visual context into referring expression models and finds that visual comparison to other objects within an image helps improve performance significantly.

Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World

This paper introduces Logical Semantics with Perception (LSP), a model for grounded language acquisition that learns to map natural language statements to their referents in a physical environment and finds that LSP outperforms existing, less expressive models that cannot represent relational language.

Modeling Context Between Objects for Referring Expression Understanding

A technique that integrates context between objects to understand referring expressions is proposed, which uses an LSTM to learn the probability of a referring expression, with input features from a region and a context region.

Segmentation from Natural Language Expressions

An end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information is proposed that can produce quality segmentation output from the natural language expression, and outperforms baseline methods by a large margin.

Grounding of Textual Phrases in Images by Reconstruction

A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets.

Visual Relationship Detection with Language Priors

This work proposes a model that can scale to predict thousands of types of relationships from a few examples and improves on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship.

Neural Module Networks

A procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural "modules" into deep networks for question answering, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.).

Visual7W: Grounded Question Answering in Images

A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.

Natural Language Object Retrieval

Experimental results demonstrate that the SCRC model effectively utilizes both local and global information, outperforming previous baseline methods significantly on different datasets and scenarios, and can exploit large scale vision and language datasets for knowledge transfer.