• Corpus ID: 236772495

GTNet: Guided Transformer Network for Detecting Human-Object Interactions

  title={GTNet: Guided Transformer Network for Detecting Human-Object Interactions},
  author={A S M Iftekhar and Satish Kumar and R. Austin McEver and Suya You and B. S. Manjunath},
The human-object interaction (HOI) detection task refers to localizing humans, localizing objects, and predicting the interactions between each human-object pair. HOI is considered one of the fundamental steps in truly understanding complex visual scenes. For detecting HOI, it is important to utilize relative spatial configurations and object semantics to find salient spatial regions of images that highlight the interactions between human object pairs. This issue is addressed by the novel self… 

Figures and Tables from this paper

What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions
We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and


iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection
This paper proposes an instance-centric attention module that learns to dynamically highlight regions in an image conditioned on the appearance of each instance and allows an attention-based network to selectively aggregate features relevant for recognizing HOIs.
VSGNet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions
The proposed Visual-Spatial-Graph Network (VSGNet) architecture extracts visual features from the human-object pairs, refines the features with spatial configurations of the pair, and utilizes the structural connections between the pair via graph convolutions.
DRG: Dual Relation Graph for Human-Object Interaction Detection
The proposed dual relation graph effectively captures discriminative cues from the scene to resolve ambiguity from local predictions and leads to favorable results compared to the state-of-the-art HOI detection algorithms on two large-scale benchmark datasets.
Learning to Detect Human-Object Interactions
Experiments demonstrate that the proposed Human-Object Region-based Convolutional Neural Networks (HO-RCNN), by exploiting human-object spatial relations through Interaction Patterns, significantly improves the performance of HOI detection over baseline approaches.
Deep Contextual Attention for Human-Object Interaction Detection
This work proposes a contextual attention framework for human-object interaction detection that leverages context by learning contextually-aware appearance features for human and object instances and adaptively selects relevant instance-centric context information to highlight image regions likely to contain human- object interactions.
Learning Human-Object Interaction Detection Using Interaction Points
This paper proposes a novel fully-convolutional approach that directly detects the interactions between human-object pairs and predicts interaction points, which directly localize and classify the inter-action.
HOTR: End-to-End Human-Object Interaction Detection with Transformers
This paper presents a novel framework, referred by HOTR, which directly predicts a set of 〈human, object, interaction〉 triplets from an image based on a transformer encoder-decoder architecture and achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.
ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection
ConsNet is proposed, a knowledge-aware framework that explicitly encodes the relations among objects, actions and interactions into an undirected graph called consistency graph, and exploits Graph Attention Networks (GATs) to propagate knowledge among HOI categories as well as their constituents.
Detecting Human-Object Interactions via Functional Generalization
This work presents an approach for detecting human-object interactions (HOIs) in images, based on the idea that humans interact with functionally similar objects in a similar manner, and demonstrates that using a generic object detector, the model can generalize to interactions involving previously unseen objects.
Visual Compositional Learning for Human-Object Interaction Detection
A deep Visual Compositional Learning (VCL) framework is devised, which is a simple yet efficient framework to effectively address the problem of human-Object interaction detection and largely alleviates the long-tail distribution problem and benefits low-shot or zero-shot HOI detection.