Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition

  title={Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition},
  author={Mingfei Han and David Junhao Zhang and Yali Wang and Rui Yan and L. Yao and Xiaojun Chang and Y. Qiao},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Learning spatial-temporal relation among multiple actors is crucial for group activity recognition. Different group activities often show the diversified interactions between actors in the video. Hence, it is often difficult to model complex group activities from a single view of spatial-temporal actor evolution. To tackle this problem, we propose a distinct Dual-path Actor Interaction (Dual-AI) framework, which flexibly arranges spatial and temporal transformers in two complementary orders… 

Learning Action-guided Spatio-temporal Transformer for Group Activity Recognition

A novel Action-guided Spatio-Temporal transFormer (ASTFormer) is proposed to capture the interaction relations for group activity recognition by learning action-centric aggregation and modeling spatio-temporal action dependencies.

An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

This work proposes to use local window attention to encode rich local spatio-temporal representations in the early stages while applying global attention modules to capture long-term space-time dependencies in the later stages, which serves as an effective and efficient end-to-end Transformer-based framework for action detection.

Human Action Recognition from Various Data Modalities: A Review

This paper presents a comprehensive survey of recent progress in deep learning methods for HAR based on the type of input data modality, including the fusion-based and the co-learning-based frameworks.



A Hierarchical Deep Temporal Model for Group Activity Recognition

A 2-stage deep temporal model designed to represent action dynamics of individual people in a sequence and another LSTM model is designed to aggregate person-level information for whole activity understanding is presented.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

What are they doing? : Collective activity classification using spatio-temporal relationship among people

A new framework for pedestrian action categorization that enables the classification of actions whose semantic can be only analyzed by looking at the collective behavior of pedestrians in the scene and outperforms state-of-the art action classification techniques.

Social Adaptive Module for Weakly-supervised Group Activity Recognition

This paper presents a new task named weakly-supervised group activity recognition (GAR) which differs from conventional GAR tasks in that only video-level labels are available, yet the important

GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer

A novel group activity recognition network termed GroupFormer captures spatial-temporal contextual information jointly to augment the individual and group representations effectively with a clustered spatial- Temporal transformer.

Spatio-Temporal Dynamic Inference Network for Group Activity Recognition

The proposed Dynamic Inference Network (DIN), which composes of Dynamic Relation module and Dynamic Walk module, achieves significant improvement compared to previous state-of-the-art methods on two popular datasets under the same setting, while costing much less computation overhead of the reasoning module.

Learning Visual Context for Group Activity Recognition

This paper proposes a new reasoning paradigm to incorporate global contextual information, Transformer based Context Encoding (TCE) module, which enhances individual representation by encodingglobal contextual information to individual features and refining the aggregated information.

Actor-Transformers for Group Activity Recognition

This paper proposes an actor-transformer model able to learn and selectively extract information relevant for group activity recognition, and achieves state-of-the-art results on two publicly available benchmarks for Group activity recognition.

Learning Actor Relation Graphs for Group Activity Recognition

This paper proposes to build a flexible and efficient Actor Relation Graph (ARG) to simultaneously capture the appearance and position relation between actors, and performs extensive experiments on two standard group activity recognition datasets.

Social Scene Understanding: End-to-End Multi-person Action Localization and Collective Activity Recognition

A single architecture is proposed that does not rely on external detection algorithms but rather is trained end-to-end to generate dense proposal maps that are refined via a novel inference scheme.