Scene Graph Generation by Iterative Message Passing

  title={Scene Graph Generation by Iterative Message Passing},
  author={Danfei Xu and Yuke Zhu and Christopher Bongsoo Choy and Li Fei-Fei},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Danfei XuYuke Zhu Li Fei-Fei
  • Published 10 January 2017
  • Computer Science
  • 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Understanding a visual scene goes beyond recognizing individual objects in isolation. [] Key Method Our joint inference model can take advantage of contextual cues to make better predictions on objects and their relationships. The experiments show that our model significantly outperforms previous methods on the Visual Genome dataset as well as support relation inference in NYU Depth V2 dataset.

Figures and Tables from this paper

Iterative Scene Graph Generation with Generative Transformers

This work in-troduces a generative transformer-based approach to gen-erating scene graphs beyond link prediction, outperforming state-of-the-art SGG approaches while offering competitive performance to unbiased S GG approaches.

Visual Graphs from Motion (VGfM): Scene understanding with object geometry reasoning

This paper proposes a system that first computes the geometrical location of objects in a generic scene and then efficiently constructs scene graphs from video by embedding such geometric reasoning in a new model where geometric and visual features are merged using an RNN framework.

Exploring and Exploiting the Hierarchical Structure of a Scene for Scene Graph Generation

A novel neural network model is used to construct a hierarchical structure whose leaf nodes correspond to objects depicted in the image, and a message is passed along the estimated structure on the fly to maintain global consistency.

Scenes and Surroundings: Scene Graph Generation using Relation Transformer

A novel local-context aware relation transformer architecture has been proposed which also exploits complex global object to object and object to edge interactions and efficiently captures dependencies between objects and predicts contextual relationships.

Scene Graph Generation Based on Node-Relation Context Module

A node-relation context module for scene graph generation that uses GRU hidden states of the nodes and the edges to guide the attention of subject and object regions and is competitive with the current methods on Visual Genome dataset.

Scene Graph Generation by Belief RNNs

A novel deep structure prediction module Belief RNNs is introduced that performs learning on a large graphs in a very efficient and generic way and presents a method in an end-to-end model that given an image generates a scene graph.

Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing

A framework for jointly grounding objects that follow certain semantic relationship constraints given in a scene graph, referred to as Visio-Lingual Message Passing Graph Neural Network (VL-MPAG Net), which significantly outperforms the baselines on four public datasets.

Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation

A subgraph-based connection graph is proposed to concisely represent the scene graph during the inference to improve the efficiency of scene graph generation and outperforms the state-of-the-art method in both accuracy and speed.

Unconditional Scene Graph Generation

This work develops a deep auto-regressive model called SceneGraphGen which can directly learn the probability distribution over labelled and directed graphs using a hierarchical recurrent architecture and demonstrates the application of the generated graphs in image synthesis, anomaly detection and scene graph completion.

LinkNet: Relational Embedding for Scene Graph

This paper designs a simple and effective relational embedding module that enables the model to jointly represent connections among all related objects, rather than focus on an object in isolation, and proves its efficacy in scene graph generation.

Characterizing structural relationships in scenes using graph kernels

This paper shows how to represent scenes as graphs that encode models and their semantic relationships, and shows that incorporating structural relationships allows the method to provide a more relevant set of results when compared against previous approaches to model context search.

Semantic Object Parsing with Graph LSTM

The Graph Long Short-Term Memory network is proposed, which is the generalization of LSTM from sequential data or multi-dimensional data to general graph-structured data.

Image retrieval using scene graphs

A conditional random field model that reasons about possible groundings of scene graphs to test images and shows that the full model can be used to improve object localization compared to baseline methods and outperforms retrieval methods that use only objects or low-level image features.

Graph-Structured Representations for Visual Question Answering

This paper proposes to build graphs over the scene objects and over the question words, and describes a deep neural network that exploits the structure in these representations, and achieves significant improvements over the state-of-the-art.

Learning Spatial Knowledge for Text to 3D Scene Generation

The main innovation of this work is to show how to augment explicit constraints with learned spatial knowledge to infer missing objects and likely layouts for the objects in the scene.

Indoor Segmentation and Support Inference from RGBD Images

The goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships, to better understand how 3D cues can best inform a structured 3D interpretation.

3D-Based Reasoning with Blocks, Support, and Stability

This work proposes a new approach for parsing RGB-D images using 3D block units for volumetric reasoning, and incorporates the intuition that a good 3D representation of the scene is the one that fits the data well, and is a stable, self-supporting arrangement of objects.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

This paper considers fully connected CRF models defined on the complete set of pixels in an image and proposes a highly efficient approximate inference algorithm in which the pairwise edge potentials are defined by a linear combination of Gaussian kernels.