Scene Graph Generation from Objects, Phrases and Region Captions

@article{Li2017SceneGG,
  title={Scene Graph Generation from Objects, Phrases and Region Captions},
  author={Yikang Li and Wanli Ouyang and Bolei Zhou and Kun Wang and Xiaogang Wang},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={1270-1279}
}
Object detection, scene graph generation and region captioning, which are three scene understanding tasks at different semantic levels, are tied together: scene graphs are generated on top of objects detected in an image with their pairwise relationship predicted, while region captioning gives a language description of the objects, their attributes, relations and other context information. [] Key Method Object, phrase, and caption regions are first aligned with a dynamic graph based on their spatial and…

Figures and Tables from this paper

Scene graph generation by multi-level semantic tasks
TLDR
A Multi-level Semantic Tasks Generation Network (MSTG) to leverage mutual connections across object detection, visual relationship detection and image captioning, to solve jointly and improve the accuracy of the three vision tasks and achieve the more comprehensive and accurate understanding of scene image.
Attentive Relational Networks for Mapping Images to Scene Graphs
TLDR
A novel Attentive Relational Network that consists of two key modules with an object detection backbone to approach this problem, and accurate scene graphs are produced by the relation inference module to recognize all entities and corresponding relations.
RelTR: Relation Transformer for Scene Graph Generation
TLDR
Inspired by DETR, which excels in object detection, an end-to-end scene graph generation model RelTR is proposed which has an encoder-decoder architecture and predicts a set of relationships directly only using visual appearance without combining entities and labeling all possible predicates.
Scenes and Surroundings: Scene Graph Generation using Relation Transformer
TLDR
A novel local-context aware relation transformer architecture has been proposed which also exploits complex global object to object and object to edge interactions and efficiently captures dependencies between objects and predicts contextual relationships.
Linguistic Structures as Weak Supervision for Visual Scene Graph Generation
  • Keren Ye, Adriana Kovashka
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
TLDR
This work explores how linguistic structures in captions can benefit scene graph generation and shows extensive experimental comparisons against prior methods which leverage instance- and image-level supervision, and ablate the method to show the impact of leveraging phrasal and sequential context, and techniques to improve localization.
Relation Regularized Scene Graph Generation
TLDR
A relation regularized network (R2-Net) is proposed, which can predict whether there is a relationship between two objects and encode this relation into object feature refinement and better SGG.
Structured Neural Motifs: Scene Graph Parsing via Enhanced Context
TLDR
This work proposes Structured Motif Network (StrcMN) which predicts object labels and pairwise relationships by mining more complete global context features and significantly outperforms previous methods on the VRD and Visual Genome datasets.
LinkNet: Relational Embedding for Scene Graph
TLDR
This paper designs a simple and effective relational embedding module that enables the model to jointly represent connections among all related objects, rather than focus on an object in isolation, and proves its efficacy in scene graph generation.
Learning to Generate Scene Graph from Natural Language Supervision
TLDR
This paper proposes one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph, and designs a Transformer-based model to predict these "pseudo" labels via a masked token prediction task.
Visual Relationships as Functions:Enabling Few-Shot Scene Graph Prediction
TLDR
This work introduces the first scene graph prediction model that supports few-shot learning of predicates, enabling scene graph approaches to generalize to a set of new predicates.
...
...

References

SHOWING 1-10 OF 53 REFERENCES
Scene Graph Generation by Iterative Message Passing
TLDR
This work explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image, and proposes a novel end-to-end model that generates such structured scene representation from an input image.
Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection
TLDR
A deep Variation-structured Re-inforcement Learning (VRL) framework is proposed to sequentially discover object relationships and attributes in the whole image, and an ambiguity-aware object mining scheme is used to resolve semantic ambiguity among object categories that the object detector fails to distinguish.
Visual Translation Embedding Network for Visual Relation Detection
TLDR
This work proposes a novel feature extraction layer that enables object-relation knowledge transfer in a fully-convolutional fashion that supports training and inference in a single forward/backward pass, and proposes the first end-toend relation detection network.
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
TLDR
A Fully Convolutional Localization Network (FCLN) architecture is proposed that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with asingle round of optimization.
ViP-CNN: Visual Phrase Guided Convolutional Neural Network
TLDR
In ViP-CNN, a Phrase-guided Message Passing Structure (PMPS) is presented to establish the connection among relationship components and help the model consider the three problems jointly and Experimental results show that the Vip-CNN outperforms the state-of-art method both in speed and accuracy.
Deep Visual-Semantic Alignments for Generating Image Descriptions
  • A. Karpathy, Li Fei-Fei
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
TLDR
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.
Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues
TLDR
This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues, which produces state of the art performance on phrase localization on the Flickr30k Entities dataset and visual relationship detection on the Stanford VRD dataset.
Detecting Visual Relationships with Deep Relational Networks
TLDR
The proposed Deep Relational Network is a novel formulation designed specifically for exploiting the statistical dependencies between objects and their relationships and achieves substantial improvement over state-of-the-art on two large data sets.
PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN
TLDR
A Parallel, Pairwise Region-based, Fully Convolutional Network (PPR-FCN) for WSVRD uses a parallel FCN architecture that simultaneously performs pair selection and classification of single regions and region pairs for object and relation detection, while sharing almost all computation shared over the entire image.
Detecting Actions, Poses, and Objects with Relational Phraselets
TLDR
A novel approach to modeling human pose, together with interacting objects, based on compositional models of local visual interactions and their relations is presented, demonstrating that modeling occlusion is crucial for recognizing human-object interactions.
...
...