• Corpus ID: 59553542

Rethinking Visual Relationships for High-level Image Understanding

  title={Rethinking Visual Relationships for High-level Image Understanding},
  author={Yuanzhi Liang and Yalong Bai and Wei Zhang and Xueming Qian and Li Zhu and Tao Mei},
Relationships, as the bond of isolated entities in images, reflect the interaction between objects and lead to a semantic understanding of scenes. [...] Key Method To encourage further development in relationships, we propose a novel method to mine more valuable relationships by automatically filtering out visually-irrelevant relationships. Then, we construct a new scene graph dataset named Visually-Relevant Relationships Dataset (VrR-VG) from Visual Genome. We evaluate several existing methods in scene graph…Expand
Attention-Translation-Relation Network for Scalable Scene Graph Generation
A three-stage pipeline that employs Multi-Head Attention driven by language and spatial features, Translation Embeddings and Multi-Tasking to detect an interacting pair of objects is proposed, which is able to maximize the visual features' interpretability and capture the nature of datasets of different scales.
An Empirical Study on Leveraging Scene Graphs for Visual Question Answering
This paper investigates the use of scene graphs derived from images for Visual QA by adapting the recently proposed graph network (GN) to encode the scene graph and perform structured reasoning according to the input question.
Relationship-Aware Spatial Perception Fusion for Realistic Scene Layout Generation
Experimental results including quantitative results, qualitative results and user studies on two different scene graph datasets demonstrate the proposed framework's ability to generate complex and logical layout with multiple objects from scene graph.
Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators
This work tackles two fundamental language-and-vision tasks: image-text matching and image captioning, and demonstrates that neural scene graph generators can learn effective visual relation features to facilitate grounding language to visual relations and subsequently improve the two end applications.
Character Matters: Video Story Understanding with Character-Aware Relations
This model specifically considers the characters in a video story, as well as the relations connecting different characters and objects, and enables weakly-supervised face naming through multi-instance co-occurrence matching and supports high-level reasoning utilizing Transformer structures.
Dual ResGCN for Balanced Scene GraphGeneration
A novel ResGCN is proposed that enhances object features in a cross attention manner and stack multiple contextual coefficients to alleviate the imbalance issue and enrich the prediction diversity.
In Defense of Scene Graphs for Image Captioning
The basic idea is to close the semantic gap between the two scene graphs one derived from the input image and the other from its caption, and leverage the spatial location of objects and the Human-Object-Interaction (HOI) labels as an additional HOI graph.
Scene graph parsing and its application in cross-modal reasoning tasks
OF THE DISSERTATION Scene Graph Parsing And Its Application in Cross-Modal Reasoning Tasks


Detecting Visual Relationships with Deep Relational Networks
The proposed Deep Relational Network is a novel formulation designed specifically for exploiting the statistical dependencies between objects and their relationships and achieves substantial improvement over state-of-the-art on two large data sets.
Scene Graph Generation from Objects, Phrases and Region Captions
This work proposes a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner and shows the joint learning across three tasks with the proposed method can bring mutual improvements over previous models.
Exploring Visual Relationship for Image Captioning
This paper introduces a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework that novelly integrates both semantic and spatial object relationships into image encoder.
Visual Translation Embedding Network for Visual Relation Detection
This work proposes a novel feature extraction layer that enables object-relation knowledge transfer in a fully-convolutional fashion that supports training and inference in a single forward/backward pass, and proposes the first end-toend relation detection network.
Large-Scale Visual Relationship Understanding
A new relationship detection model is developed that embeds objects and relations into two vector spaces where both discriminative capability and semantic affinity are preserved and can achieve superior performance even when the visual entity categories scale up to more than 80,000, with extremely skewed class distribution.
Neural Motifs: Scene Graph Parsing with Global Context
This work analyzes the role of motifs: regularly appearing substructures in scene graphs and introduces Stacked Motif Networks, a new architecture designed to capture higher order motifs in scene graph graphs that improves on the previous state-of-the-art by an average of 3.6% relative improvement across evaluation settings.
Scene Graph Generation by Iterative Message Passing
This work explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image, and proposes a novel end-to-end model that generates such structured scene representation from an input image.
Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection
A deep Variation-structured Re-inforcement Learning (VRL) framework is proposed to sequentially discover object relationships and attributes in the whole image, and an ambiguity-aware object mining scheme is used to resolve semantic ambiguity among object categories that the object detector fails to distinguish.
Weakly-Supervised Learning of Visual Relations
A novel approach for modeling visual relations between pairs of objects where the predicate is typically a preposition or a verb that links a pair of objects, and proposes a weakly-supervised discriminative clustering model to learn relations from image-level labels only.
Graph R-CNN for Scene Graph Generation
A novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images, is proposed and a new evaluation metric is introduced that is more holistic and realistic than existing metrics.