Learning to Compose Dynamic Tree Structures for Visual Contexts

  title={Learning to Compose Dynamic Tree Structures for Visual Contexts},
  author={Kaihua Tang and Hanwang Zhang and Baoyuan Wu and Wenhan Luo and W. Liu},
  journal={2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Kaihua TangHanwang Zhang W. Liu
  • Published 5 December 2018
  • Computer Science
  • 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A. Our visual context tree model, dubbed VCTree, has two key advantages over existing structured object representations including chains and fully-connected graphs: 1) The efficient and expressive binary tree encodes the inherent parallel/hierarchical relationships among objects, e.g., ``clothes'' and ``pants'' are… 

Dynamic Scene Graph Generation via Anticipatory Pre-training

A novel anticipatory pre-training paradigm based on Transformer is proposed to explicitly model the temporal correlation of visual relationships in different frames to improve dynamic scene graph generation.

RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition

It is shown that modeling an effective message-passing flow through an attention mechanism can be critical to tackling the compositionality and long-tail challenges in VRR.

One-shot Scene Graph Generation

This paper designs a task named One-Shot Scene Graph Generation, where each relationship triplet comes from only one labeled example, and proposes Multiple Structured Knowledge (Relational Knowledge and Commonsense Knowledge) for the one-shot scene graph generation task.

Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation

This work argues that a desirable scene graph should be also hierarchically constructed, and introduces a new scheme for modeling scene graph, and devise a Relation Ranking Module (RRM) to dynamically adjust their rankings by learning to capture humans' subjective perceptive habits from objective entity saliency and size.

Exploring and Exploiting the Hierarchical Structure of a Scene for Scene Graph Generation

A novel neural network model is used to construct a hierarchical structure whose leaf nodes correspond to objects depicted in the image, and a message is passed along the estimated structure on the fly to maintain global consistency.

Context-aware Scene Graph Generation with Seq2Seq Transformers

This work proposes an encoder-decoder model built using Transformers where the encoder captures global context and long range interactions, and introduces a novel reinforcement learning-based training strategy tailored to Seq2Seq scene graph generation.

Joint Modeling of Visual Objects and Relations for Scene Graph Generation

This paper establishes a unified conditional random field (CRF) to model the joint distribution of all the objects and their relations in a scene graph and proposes an efficient and effective algorithm for inference based on meanfield variational inference.

Learning to Generate Scene Graph from Natural Language Supervision

This paper proposes one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph, and designs a Transformer-based model to predict these "pseudo" labels via a masked token prediction task.

Knowledge-Based Scene Graph Generation with Visual Contextual Dependency

A novel knowledge-based model with adjustable visual contextual dependency that can obtain better global and contextual information for predicting object relationships, and the visual dependencies can be adjusted through the two loss functions.

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

CMAT is a multi-agent policy gradient method that frames objects into cooperative agents, and then directly maximizes a graph-level metric as the reward, and uses a counterfactual baseline that disentangles the agent-specific reward by fixing the predictions of other agents.



Scene Graph Generation from Objects, Phrases and Caption Regions

This work proposes a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner and shows the joint learning across three tasks with the proposed method can bring mutual improvements over previous models.

Neural Motifs: Scene Graph Parsing with Global Context

This work analyzes the role of motifs: regularly appearing substructures in scene graphs and introduces Stacked Motif Networks, a new architecture designed to capture higher order motifs in scene graph graphs that improves on the previous state-of-the-art by an average of 3.6% relative improvement across evaluation settings.

Scene Graph Generation by Iterative Message Passing

This work explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image, and proposes a novel end-to-end model that generates such structured scene representation from an input image.

Graph-Structured Representations for Visual Question Answering

This paper proposes to build graphs over the scene objects and over the question words, and describes a deep neural network that exploits the structure in these representations, and achieves significant improvements over the state-of-the-art.

Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation

A subgraph-based connection graph is proposed to concisely represent the scene graph during the inference to improve the efficiency of scene graph generation and outperforms the state-of-the-art method in both accuracy and speed.

Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships

This work presents a so-called Structure Inference Network (SIN), a detector that incorporates into a typical detection framework with a graphical model which aims to infer object state and comprehensive experiments indicate that scene context and object relationships truly improve the performance of object detection with more desirable and reasonable outputs.

Auto-Encoding Graphical Inductive Bias for Descriptive Image Captioning

This work proposes Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions and validates the effectiveness of SGAE on the challenging MS-COCOimage captioning benchmark.

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

A Context-Aware Visual Policy network (CAVP) for sequence-level image captioning that explicitly accounts for the previous visual attentions as the context, and then decides whether the context is helpful for the current word generation given the current visual attention.

Iterative Visual Reasoning Beyond Convolutions

Analysis shows that the framework is resilient to missing regions for reasoning and shows strong performance over plain ConvNets, e.g. achieving an 8.4% absolute improvement on ADE measured by per-class average precision.

Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features

Extensive experiments on two visual relationship benchmarks show that by using the novel Shuffle-Then-Assemble pre-trained features, naive relationship models can be consistently improved and even outperform other state-of-the-art relationship models.