RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection

  title={RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection},
  author={Hangjie Yuan and Jianwen Jiang and Samuel Albanie and Tao Feng and Ziyuan Huang and Dong Ni and Mingqian Tang},
The task of Human-Object Interaction (HOI) detection targets fine-grained visual parsing of humans interacting with their environment, enabling a broad range of applications. Prior work has demonstrated the benefits of effective architecture design and integration of relevant cues for more accurate HOI detection. However, the design of an appropriate pre-training strategy for this task remains underexplored by existing approaches. To address this gap, we propose Relational Language-Image Pre… 

Progressive Learning without Forgetting

This work focuses on two challenging problems in the paradigm of Continual Learning without involving any old data: the accumulation of catastrophic forgetting caused by the gradually fading knowledge space and the uncontrolled tug-of-war dynamics to balance the stability and plasticity during the learning of new tasks.

End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

This work designs an Interactive Score module combined with a Two-stage Bipartite Matching algorithm to achieve interaction distinguishment for human- object pairs in an action-agnostic manner, and proposes a novel EoID framework via vision-language knowledge distillation.



Learning to Detect Human-Object Interactions With Knowledge

This work tackles the challenge of long-tail HOI categories by modeling the underlying regularities among verbs and objects in HOIs as well as general relationships, and addresses the necessity of dynamic image-specific knowledge retrieval by multi-modal learning, which leads to an enhanced semantic embedding space for HOI comprehension.

HOTR: End-to-End Human-Object Interaction Detection with Transformers

This paper presents a novel framework, referred by HOTR, which directly predicts a set of 〈human, object, interaction〉 triplets from an image based on a transformer encoder-decoder architecture and achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.

Visual Compositional Learning for Human-Object Interaction Detection

A deep Visual Compositional Learning (VCL) framework is devised, which is a simple yet efficient framework to effectively address the problem of human-Object interaction detection and largely alleviates the long-tail distribution problem and benefits low-shot or zero-shot HOI detection.

Grounded Language-Image Pre-training

A grounded language-image pretraining model for learning object-level, language-aware, and semantic-rich visual representations that unifies object detection and phrase grounding for pre-training and can leverage massive image-text pairs by generating grounding boxes in a self-training fashion.

GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection

Guided-Embedding Network (GEN) is proposed to attain a two-branch pipeline without post-matching and Visual-Linguistic Knowledge Transfer (VLKT) training strategy to enhance interaction understanding by transferring knowledge from a visual-linguistic pre-trained model CLIP is proposed.

No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques

We show that for human-object interaction detection a relatively simple factorized model with appearance and layout encodings constructed from pre-trained object detectors outperforms more

QAHOI: Query-Based Anchors for Human-Object Interaction Detection

A transformer-based method, QAHOI (Query-Based Anchors for Human-Object Interaction detection), which leverages a multi-scale architecture to extract features from different spatial scales and uses query-based anchors to predict all the elements of an HOI instance.

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

This paper proposes to utilize a Verb Semantic Model (VSM) and use semantic aggregation to profit from this object-guided hierarchy and proposes to generate cross-modality-aware visual and semantic features by Cross-Modal Calibration (CMC).

iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection

This paper proposes an instance-centric attention module that learns to dynamically highlight regions in an image conditioned on the appearance of each instance and allows an attention-based network to selectively aggregate features relevant for recognizing HOIs.