HCVRD: A Benchmark for Large-Scale Human-Centered Visual Relationship Detection

  title={HCVRD: A Benchmark for Large-Scale Human-Centered Visual Relationship Detection},
  author={Bohan Zhuang and Qi Wu and Chunhua Shen and Ian D. Reid and Anton van den Hengel},
Visual relationship detection aims to capture interactions between pairs of objects in images. Relationships between objects and humans represent a particularly important subset of this problem, with implications for challenges such as understanding human behavior, and identifying affordances, amongst others. In addressing this problem we first construct a large-scale human-centric visual relationship detection dataset (HCVRD), which provides many more types of relationship annotations… 

Figures and Tables from this paper

Learning to Detect Human-Object Interactions With Knowledge

This work tackles the challenge of long-tail HOI categories by modeling the underlying regularities among verbs and objects in HOIs as well as general relationships, and addresses the necessity of dynamic image-specific knowledge retrieval by multi-modal learning, which leads to an enhanced semantic embedding space for HOI comprehension.

Detecting Human-Object Interactions with Action Co-occurrence Priors

This paper model the correlations among human-object interactions as action co-occurrence matrices and present techniques to learn these priors and leverage them for more effective training, especially in rare classes.

DRG: Dual Relation Graph for Human-Object Interaction Detection

The proposed dual relation graph effectively captures discriminative cues from the scene to resolve ambiguity from local predictions and leads to favorable results compared to the state-of-the-art HOI detection algorithms on two large-scale benchmark datasets.

Deep Contextual Attention for Human-Object Interaction Detection

This work proposes a contextual attention framework for human-object interaction detection that leverages context by learning contextually-aware appearance features for human and object instances and adaptively selects relevant instance-centric context information to highlight image regions likely to contain human- object interactions.

Zero-Shot Human-Object Interaction Recognition via Affordance Graphs

A new approach for Zero-Shot Human-Object Interaction Recognition in the challenging setting that involves interactions with unseen actions (as opposed to just unseen combinations of seen actions and objects) is proposed and outperforms the current state of the art.

Learning to detect visual relations

A weakly-supervised approach is proposed which, given pre-trained object detectors, enables us to learn relation detectors using image-level labels only, maintaining a performance close to fully- supervised models.

Interact as You Intend: Intention-Driven Human-Object Interaction Detection

The proposed human intention-driven HOI detection (iHOI) framework models human pose with the relative distances from body joints to the object instances and utilizes human gaze to guide the attended contextual regions in a weakly-supervised setting.

Cascaded Human-Object Interaction Recognition

This work introduces a cascade architecture for a multi-stage, coarse-to-fine HOI understanding, and makes the framework flexible to perform fine-grained pixel-wise relation segmentation; this provides a new glimpse into better relation modeling.

Few-Shot Human-Object Interaction Recognition With Semantic-Guided Attentive Prototypes Network

A Semantic-guided Attentive Prototypes Network (SAPNet) framework to learn a semantic-guided metric space where HOI recognition can be performed by computing distances to attentive prototypes of each class, and generates attentive prototypes guided by the category names of actions and objects.

Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions

Bongard-HOI, a new visual reasoning benchmark that focuses on compositional learning of humanobject interactions (HOIs) from natural images, is introduced, inspired by two desirable characteristics from the classical Bongard problems: fewshot concept learning and contextdependent reasoning.



Learning to Detect Human-Object Interactions

Experiments demonstrate that the proposed Human-Object Region-based Convolutional Neural Networks (HO-RCNN), by exploiting human-object spatial relations through Interaction Patterns, significantly improves the performance of HOI detection over baseline approaches.

Visual Relationship Detection with Language Priors

This work proposes a model that can scale to predict thousands of types of relationships from a few examples and improves on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection

A deep Variation-structured Re-inforcement Learning (VRL) framework is proposed to sequentially discover object relationships and attributes in the whole image, and an ambiguity-aware object mining scheme is used to resolve semantic ambiguity among object categories that the object detector fails to distinguish.

HICO: A Benchmark for Recognizing Human-Object Interactions in Images

An in-depth analysis of representative current approaches is performed and it is shown that DNNs enjoy a significant edge and that semantic knowledge can significantly improve HOI recognition, especially for uncommon categories.

Visual Semantic Role Labeling

The problem of Visual Semantic Role Labeling is introduced: given an image the authors want to detect people doing actions and localize the objects of interaction and associate objects in the scene with different semantic roles for each action.

Grouplet: A structured image representation for recognizing human and object interactions

  • Bangpeng YaoLi Fei-Fei
  • Computer Science
    2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
  • 2010
It is shown that grouplets are more effective in classifying and detecting human-object interactions than other state-of-the-art methods and can make a robust distinction between humans playing the instruments and humans co-occurring with the instruments without playing.

ViP-CNN: Visual Phrase Guided Convolutional Neural Network

In ViP-CNN, a Phrase-guided Message Passing Structure (PMPS) is presented to establish the connection among relationship components and help the model consider the three problems jointly and Experimental results show that the Vip-CNN outperforms the state-of-art method both in speed and accuracy.

Attribute-Based Classification for Zero-Shot Visual Object Categorization

We study the problem of object recognition for categories for which we have no training examples, a task also called zero--data or zero-shot learning. This situation has hardly been studied in

Attend in Groups: A Weakly-Supervised Deep Learning Framework for Learning from Web Data

This work proposes an end-to-end weakly-supervised deep learning framework which is robust to the label noise in Web images and relies on two unified strategies, random grouping and attention, to effectively reduce the negative impact of noisy web image annotations.