Scene Graph Generation from Objects, Phrases and Region Captions
@article{Li2017SceneGG, title={Scene Graph Generation from Objects, Phrases and Region Captions}, author={Yikang Li and Wanli Ouyang and Bolei Zhou and Kun Wang and Xiaogang Wang}, journal={2017 IEEE International Conference on Computer Vision (ICCV)}, year={2017}, pages={1270-1279} }
Object detection, scene graph generation and region captioning, which are three scene understanding tasks at different semantic levels, are tied together: scene graphs are generated on top of objects detected in an image with their pairwise relationship predicted, while region captioning gives a language description of the objects, their attributes, relations and other context information. [] Key Method Object, phrase, and caption regions are first aligned with a dynamic graph based on their spatial and…
307 Citations
Scene graph generation by multi-level semantic tasks
- Computer ScienceAppl. Intell.
- 2021
A Multi-level Semantic Tasks Generation Network (MSTG) to leverage mutual connections across object detection, visual relationship detection and image captioning, to solve jointly and improve the accuracy of the three vision tasks and achieve the more comprehensive and accurate understanding of scene image.
Attentive Relational Networks for Mapping Images to Scene Graphs
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
A novel Attentive Relational Network that consists of two key modules with an object detection backbone to approach this problem, and accurate scene graphs are produced by the relation inference module to recognize all entities and corresponding relations.
RelTR: Relation Transformer for Scene Graph Generation
- Computer ScienceArXiv
- 2022
Inspired by DETR, which excels in object detection, an end-to-end scene graph generation model RelTR is proposed which has an encoder-decoder architecture and predicts a set of relationships directly only using visual appearance without combining entities and labeling all possible predicates.
Scenes and Surroundings: Scene Graph Generation using Relation Transformer
- Computer ScienceArXiv
- 2021
A novel local-context aware relation transformer architecture has been proposed which also exploits complex global object to object and object to edge interactions and efficiently captures dependencies between objects and predicts contextual relationships.
Linguistic Structures as Weak Supervision for Visual Scene Graph Generation
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
This work explores how linguistic structures in captions can benefit scene graph generation and shows extensive experimental comparisons against prior methods which leverage instance- and image-level supervision, and ablate the method to show the impact of leveraging phrasal and sequential context, and techniques to improve localization.
Relation Regularized Scene Graph Generation
- Computer ScienceIEEE transactions on cybernetics
- 2021
A relation regularized network (R2-Net) is proposed, which can predict whether there is a relationship between two objects and encode this relation into object feature refinement and better SGG.
Structured Neural Motifs: Scene Graph Parsing via Enhanced Context
- Computer ScienceMMM
- 2020
This work proposes Structured Motif Network (StrcMN) which predicts object labels and pairwise relationships by mining more complete global context features and significantly outperforms previous methods on the VRD and Visual Genome datasets.
LinkNet: Relational Embedding for Scene Graph
- Computer ScienceNeurIPS
- 2018
This paper designs a simple and effective relational embedding module that enables the model to jointly represent connections among all related objects, rather than focus on an object in isolation, and proves its efficacy in scene graph generation.
Learning to Generate Scene Graph from Natural Language Supervision
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This paper proposes one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph, and designs a Transformer-based model to predict these "pseudo" labels via a masked token prediction task.
Visual Relationships as Functions:Enabling Few-Shot Scene Graph Prediction
- Computer Science2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)
- 2019
This work introduces the first scene graph prediction model that supports few-shot learning of predicates, enabling scene graph approaches to generalize to a set of new predicates.
References
SHOWING 1-10 OF 53 REFERENCES
Scene Graph Generation by Iterative Message Passing
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This work explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image, and proposes a novel end-to-end model that generates such structured scene representation from an input image.
Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
A deep Variation-structured Re-inforcement Learning (VRL) framework is proposed to sequentially discover object relationships and attributes in the whole image, and an ambiguity-aware object mining scheme is used to resolve semantic ambiguity among object categories that the object detector fails to distinguish.
Visual Translation Embedding Network for Visual Relation Detection
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This work proposes a novel feature extraction layer that enables object-relation knowledge transfer in a fully-convolutional fashion that supports training and inference in a single forward/backward pass, and proposes the first end-toend relation detection network.
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A Fully Convolutional Localization Network (FCLN) architecture is proposed that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with asingle round of optimization.
ViP-CNN: Visual Phrase Guided Convolutional Neural Network
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
In ViP-CNN, a Phrase-guided Message Passing Structure (PMPS) is presented to establish the connection among relationship components and help the model consider the three problems jointly and Experimental results show that the Vip-CNN outperforms the state-of-art method both in speed and accuracy.
Deep Visual-Semantic Alignments for Generating Image Descriptions
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2017
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.
Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues, which produces state of the art performance on phrase localization on the Flickr30k Entities dataset and visual relationship detection on the Stanford VRD dataset.
Detecting Visual Relationships with Deep Relational Networks
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
The proposed Deep Relational Network is a novel formulation designed specifically for exploiting the statistical dependencies between objects and their relationships and achieves substantial improvement over state-of-the-art on two large data sets.
PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
A Parallel, Pairwise Region-based, Fully Convolutional Network (PPR-FCN) for WSVRD uses a parallel FCN architecture that simultaneously performs pair selection and classification of single regions and region pairs for object and relation detection, while sharing almost all computation shared over the entire image.
Detecting Actions, Poses, and Objects with Relational Phraselets
- Computer ScienceECCV
- 2012
A novel approach to modeling human pose, together with interacting objects, based on compositional models of local visual interactions and their relations is presented, demonstrating that modeling occlusion is crucial for recognizing human-object interactions.