Improving Visual Relationship Detection Using Semantic Modeling of Scene Descriptions

@inproceedings{Baier2017ImprovingVR,
  title={Improving Visual Relationship Detection Using Semantic Modeling of Scene Descriptions},
  author={Stephan Baier and Yunpu Ma and Volker Tresp},
  booktitle={SEMWEB},
  year={2017}
}
Structured scene descriptions of images are useful for the automatic processing and querying of large image databases. We show how the combination of a statistical semantic model and a visual model can improve on the task of mapping images to their associated scene description. In this paper we consider scene descriptions which are represented as a set of triples (subject, predicate, object), where each triple consists of a pair of visual objects, which appear in the image, and the relationship… 

Improving Information Extraction from Images with Learned Semantic Models

TLDR
The proposed model is compared to a novel conditional multi-way model for visual relationship detection, which does not include an explicitly trained visual prior model and potential relationships between the proposed methods and memory models of the human brain are discussed.

Compensating Supervision Incompleteness with Prior Knowledge in Semantic Image Interpretation

TLDR
Logic Tensor Networks are used, a novel Statistical Relational Learning framework that exploits both the similarities with other seen relationships and background knowledge, expressed with logical constraints between subjects, relations and objects, to perform zero-shot learning.

Category Specific Prediction Modules for Visual Relation Recognition

TLDR
This work proposes a model using recent advances in novel applications of Convolution Neural Networks (Deep Learning) combined with a divide and conquer approach to relation detection that provides recall rates that are comparable to state of the art research, while still being precise and accurate for the specific relation categories.

Classification by Attention: Scene Graph Classification with Prior Knowledge

TLDR
This work takes a multi-task learning approach by introducing schema representations and implementing the classification as an attention layer between image-based representations and the schemata, allowing for the prior knowledge to emerge and propagate within the perception model.

Tackling the Unannotated: Scene Graph Generation with Bias-Reduced Models

TLDR
This work takes a multi-task learning approach, where the classification is implemented as an attention layer for perception and prior knowledge, and shows that the model can accurately generate commonsense knowledge and that the iterative injection of this knowledge to scene representations leads to a significantly higher classification performance.

Improving Semantic Annotation Using Semantic Modeling of Knowledge Embedding

TLDR
A novel method to semantic modeling with prior knowledge embedding to jointly find the semantic objects and the corresponding support relationships in the images and the parameters can be learned at the supervised learning stage.

Attention-Translation-Relation Network for Scalable Scene Graph Generation

TLDR
A three-stage pipeline that employs Multi-Head Attention driven by language and spatial features, Translation Embeddings and Multi-Tasking to detect an interacting pair of objects is proposed, which is able to maximize the visual features' interpretability and capture the nature of datasets of different scales.

Improving Visual Relation Detection using Depth Maps

TLDR
This work proposes using an additional metric, calling it Macro Recall@K, and demonstrates its remarkable performance on VG, and confirms that by effective utilization of depth maps within a simple, yet competitive framework, the performance of visual relation detection can be improved by a margin of up to 8%.

Relation Transformer Network

TLDR
This work presents the Relation Transformer Network, which is a customized transformer-based architecture that models complex object to object and edge to object interactions, by taking into account global context.

Improving Visual Reasoning by Exploiting The Knowledge in Texts

TLDR
A transformerbased model that creates structured knowledge from textual input is proposed that enables the utilization of the knowledge in texts and can achieve ∼8x more accuracte results in scene graph classification, ∼3x in object classification, and ∼1.5x in predicate classification.

References

SHOWING 1-10 OF 40 REFERENCES

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

TLDR
This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.

Visual Relationship Detection with Language Priors

TLDR
This work proposes a model that can scale to predict thousands of types of relationships from a few examples and improves on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

TLDR
The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

Grouplet: A structured image representation for recognizing human and object interactions

  • Bangpeng YaoLi Fei-Fei
  • Computer Science
    2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
  • 2010
TLDR
It is shown that grouplets are more effective in classifying and detecting human-object interactions than other state-of-the-art methods and can make a robust distinction between humans playing the instruments and humans co-occurring with the instruments without playing.

Understanding Indoor Scenes Using 3D Geometric Phrases

TLDR
A hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification is presented.

Learning semantic relationships for better action retrieval in images

TLDR
A novel neural network framework is proposed which jointly extracts the relationship between actions and uses them for training better action retrieval models and shows a significant improvement in mean AP compared to different baseline methods.

Recognition using visual phrases

TLDR
It is shown that a visual phrase detector significantly outperforms a baseline which detects component objects and reasons about relations, even though visual phrase training sets tend to be smaller than those for objects.

Understanding web images by object relation network

TLDR
An automatic system is presented which takes a raw image as input and creates an ORN based on image visual appearance and the guide ontology, a graph model representing the most probable meaning of the objects and their relations in an image.

Translating Video Content to Natural Language Descriptions

TLDR
This paper generates a rich semantic representation of the visual content including e.g. object and activity labels and proposes to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

TLDR
This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.