Improving Visual Relation Detection using Depth Maps

  title={Improving Visual Relation Detection using Depth Maps},
  author={Sahand Sharifzadeh and Sina Moayed Baharlou and Max Berrendorf and Volker Tresp},
  journal={2020 25th International Conference on Pattern Recognition (ICPR)},
State of the art visual relation detection methods have been relying on features extracted from RGB images including objects' 2D positions. [] Key Method We discuss different feature extraction strategies from depth maps and show their critical role in relation detection. Our experiments confirm that the performance of state-of-the-art visual relation detection approaches can significantly be improved by utilizing depth map information.

Figures and Tables from this paper

Classification by Attention: Scene Graph Classification with Prior Knowledge

This work takes a multi-task learning approach by introducing schema representations and implementing the classification as an attention layer between image-based representations and the schemata, allowing for the prior knowledge to emerge and propagate within the perception model.

Tackling the Unannotated: Scene Graph Generation with Bias-Reduced Models

This work takes a multi-task learning approach, where the classification is implemented as an attention layer for perception and prior knowledge, and shows that the model can accurately generate commonsense knowledge and that the iterative injection of this knowledge to scene representations leads to a significantly higher classification performance.

Relation Transformer Network

This work presents the Relation Transformer Network, which is a customized transformer-based architecture that models complex object to object and edge to object interactions, by taking into account global context.

Relationformer: A Unified Framework for Image-to-Graph Generation

This work proposes a unified onestage transformer-based framework, namely Relationformer that jointly predicts objects and their relations and proposes a novel learnable token, namely [rln]-token, which exploits local and global semantic reasoning in an image through a series of mutual associations.

Improving Visual Reasoning by Exploiting The Knowledge in Texts

A transformerbased model that creates structured knowledge from textual input is proposed that enables the utilization of the knowledge in texts and can achieve ∼8x more accuracte results in scene graph classification, ∼3x in object classification, and ∼1.5x in predicate classification.

Change Detection in Aerial Images Using Three-Dimensional Feature Maps

A robust method for change detection in aerial images that extracts three-dimensional features for segmentation of objects above a defined reference surface at each instant and demonstrates the robustness of the method in addressing the problems of conventional change detection methods.

Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework

This paper re-implemented and evaluated 21 models in the PyKEEN software package, and performed a large-scale benchmarking on four datasets, providing evidence that several architectures can obtain results competitive to the state of the art when configured carefully.

The Tensor Brain: Semantic Decoding for Perception and Memory

It is argued that a biological realization of perception and memory imposes constraints on information processing, and proposed that explicit perception and declarative memories require a semantic decoder, which is based on four layers.

A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge Graphs

This work proposes several new rank-based metrics that are more easily interpreted and compared accompanied by a demonstration of their usage in a benchmarking of knowledge graph embedding models.

The Tensor Brain: A Unified Theory of Perception, Memory and Semantic Decoding

It is argued that it is important for the agent to represent specific entities, like Jack and Sparky, and not just attributes and classes, to analyze visual scenes, spatial and social networks, and as a prerequisite for an explicit episodic ∗.



Learning Rich Features from RGB-D Images for Object Detection and Segmentation

A new geocentric embedding is proposed for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity to facilitate the use of perception in fields like robotics.

Improving Visual Relationship Detection Using Semantic Modeling of Scene Descriptions

It is shown how the combination of a statistical semantic model and a visual model can improve on the task of mapping images to their associated scene description, and achieves superior performance compared to the state-of-the-art method from the Stanford computer vision group.

Visual Relationship Prediction via Label Clustering and Incorporation of Depth Information

In this paper, we investigate the use of an unsupervised label clustering technique and demonstrate that it enables substantial improvements in visual relationship prediction accuracy on the Person

Visual Translation Embedding Network for Visual Relation Detection

This work proposes a novel feature extraction layer that enables object-relation knowledge transfer in a fully-convolutional fashion that supports training and inference in a single forward/backward pass, and proposes the first end-toend relation detection network.

Deeper Depth Prediction with Fully Convolutional Residual Networks

A fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps is proposed and a novel way to efficiently learn feature map up-sampling within the network is presented.

FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture

This paper proposes an encoder-decoder type network, where the encoder part is composed of two branches of networks that simultaneously extract features from RGB and depth images and fuse depth features into the RGB feature maps as the network goes deeper.

Indoor Segmentation and Support Inference from RGBD Images

The goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships, to better understand how 3D cues can best inform a structured 3D interpretation.

Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

This work uses knowledge of linguistic statistics to regularize visual model learning and suggests that with this linguistic knowledge distillation, the model outperforms the state-of- the-art methods significantly, especially when predicting unseen relationships.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.