Corpus ID: 237260210

Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries

  title={Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries},
  author={Qi Dong and Zhuowen Tu and Haofu Liao and Yuting Zhang and Vijay Mahadevan and Stefano Soatto},
Computer vision applications such as visual relationship detection and human object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted Partand-Sum detection Transformer (PST), to perform end-toend visual composite set detection. Different from existing Transformers in which queries… Expand


Visual Translation Embedding Network for Visual Relation Detection
This work proposes a novel feature extraction layer that enables object-relation knowledge transfer in a fully-convolutional fashion that supports training and inference in a single forward/backward pass, and proposes the first end-toend relation detection network. Expand
Pose-Aware Multi-Level Feature Network for Human Object Interaction Detection
This work develops a multi-branch deep network to learn a pose-augmented relation representation at three semantic levels, incorporating interaction context, object features and detailed semantic part cues, and demonstrates its efficacy in handling complex scenes. Expand
Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation
This work uses knowledge of linguistic statistics to regularize visual model learning and suggests that with this linguistic knowledge distillation, the model outperforms the state-of- the-art methods significantly, especially when predicting unseen relationships. Expand
Detecting Unseen Visual Relations Using Analogies
This work learns a representation of visual relations that combines individual embeddings for subject, object and predicate together with a visual phrase embedding that represents the relation triplet, and demonstrates the benefits of this approach on three challenging datasets. Expand
Detecting and Recognizing Human-Object Interactions
A novel model is proposed that learns to predict an action-specific density over target object locations based on the appearance of a detected person and efficiently infers interaction triplets in a clean, jointly trained end-to-end system the authors call InteractNet. Expand
Visual Relation Detection with Multi-Level Attention
A multi-level attention visual relation detection model (MLA-VRD), which generates salient appearance representation via a multi-stage appearance attention strategy and adaptively combine different cues with different importance weighting via amulti-cue attention strategy is proposed. Expand
DRG: Dual Relation Graph for Human-Object Interaction Detection
The proposed dual relation graph effectively captures discriminative cues from the scene to resolve ambiguity from local predictions and leads to favorable results compared to the state-of-the-art HOI detection algorithms on two large-scale benchmark datasets. Expand
Detecting Visual Relationships with Deep Relational Networks
The proposed Deep Relational Network is a novel formulation designed specifically for exploiting the statistical dependencies between objects and their relationships and achieves substantial improvement over state-of-the-art on two large data sets. Expand
Large-Scale Visual Relationship Understanding
A new relationship detection model is developed that embeds objects and relations into two vector spaces where both discriminative capability and semantic affinity are preserved and can achieve superior performance even when the visual entity categories scale up to more than 80,000, with extremely skewed class distribution. Expand
Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition
This work presents two new pooling cells to encourage feature interactions, and sheds light on how one could resolve ambiguous and noisy object and predicate annotations by Intra-Hierarchical trees (IH-tree). Expand