Improving Visual Relationship Detection Using Semantic Modeling of Scene Descriptions

  title={Improving Visual Relationship Detection Using Semantic Modeling of Scene Descriptions},
  author={Stephan Baier and Yunpu Ma and Volker Tresp},
Structured scene descriptions of images are useful for the automatic processing and querying of large image databases. We show how the combination of a statistical semantic model and a visual model can improve on the task of mapping images to their associated scene description. In this paper we consider scene descriptions which are represented as a set of triples (subject, predicate, object), where each triple consists of a pair of visual objects, which appear in the image, and the relationship… 

Improving Information Extraction from Images with Learned Semantic Models

The proposed model is compared to a novel conditional multi-way model for visual relationship detection, which does not include an explicitly trained visual prior model and potential relationships between the proposed methods and memory models of the human brain are discussed.

Compensating Supervision Incompleteness with Prior Knowledge in Semantic Image Interpretation

Logic Tensor Networks are used, a novel Statistical Relational Learning framework that exploits both the similarities with other seen relationships and background knowledge, expressed with logical constraints between subjects, relations and objects, to perform zero-shot learning.

Category Specific Prediction Modules for Visual Relation Recognition

This work proposes a model using recent advances in novel applications of Convolution Neural Networks (Deep Learning) combined with a divide and conquer approach to relation detection that provides recall rates that are comparable to state of the art research, while still being precise and accurate for the specific relation categories.

Classification by Attention: Scene Graph Classification with Prior Knowledge

This work takes a multi-task learning approach by introducing schema representations and implementing the classification as an attention layer between image-based representations and the schemata, allowing for the prior knowledge to emerge and propagate within the perception model.

Tackling the Unannotated: Scene Graph Generation with Bias-Reduced Models

This work takes a multi-task learning approach, where the classification is implemented as an attention layer for perception and prior knowledge, and shows that the model can accurately generate commonsense knowledge and that the iterative injection of this knowledge to scene representations leads to a significantly higher classification performance.

Improving Semantic Annotation Using Semantic Modeling of Knowledge Embedding

A novel method to semantic modeling with prior knowledge embedding to jointly find the semantic objects and the corresponding support relationships in the images and the parameters can be learned at the supervised learning stage.

Attention-Translation-Relation Network for Scalable Scene Graph Generation

A three-stage pipeline that employs Multi-Head Attention driven by language and spatial features, Translation Embeddings and Multi-Tasking to detect an interacting pair of objects is proposed, which is able to maximize the visual features' interpretability and capture the nature of datasets of different scales.

Improving Visual Relation Detection using Depth Maps

This work proposes using an additional metric, calling it Macro Recall@K, and demonstrates its remarkable performance on VG, and confirms that by effective utilization of depth maps within a simple, yet competitive framework, the performance of visual relation detection can be improved by a margin of up to 8%.

Relation Transformer Network

This work presents the Relation Transformer Network, which is a customized transformer-based architecture that models complex object to object and edge to object interactions, by taking into account global context.

Improving Visual Reasoning by Exploiting The Knowledge in Texts

A transformerbased model that creates structured knowledge from textual input is proposed that enables the utilization of the knowledge in texts and can achieve ∼8x more accuracte results in scene graph classification, ∼3x in object classification, and ∼1.5x in predicate classification.



Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.

Visual Relationship Detection with Language Priors

This work proposes a model that can scale to predict thousands of types of relationships from a few examples and improves on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

Grouplet: A structured image representation for recognizing human and object interactions

  • Bangpeng YaoLi Fei-Fei
  • Computer Science
    2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
  • 2010
It is shown that grouplets are more effective in classifying and detecting human-object interactions than other state-of-the-art methods and can make a robust distinction between humans playing the instruments and humans co-occurring with the instruments without playing.

Understanding Indoor Scenes Using 3D Geometric Phrases

A hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification is presented.

Learning semantic relationships for better action retrieval in images

A novel neural network framework is proposed which jointly extracts the relationship between actions and uses them for training better action retrieval models and shows a significant improvement in mean AP compared to different baseline methods.

Recognition using visual phrases

It is shown that a visual phrase detector significantly outperforms a baseline which detects component objects and reasons about relations, even though visual phrase training sets tend to be smaller than those for objects.

Understanding web images by object relation network

An automatic system is presented which takes a raw image as input and creates an ORN based on image visual appearance and the guide ontology, a graph model representing the most probable meaning of the objects and their relations in an image.

Translating Video Content to Natural Language Descriptions

This paper generates a rich semantic representation of the visual content including e.g. object and activity labels and proposes to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.