GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer

@article{Li2021GroupFormerGA,
  title={GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer},
  author={Shuaicheng Li and Qianggang Cao and Lingbo Liu and Kunlin Yang and Shinan Liu and Jun Hou and Shuai Yi},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021},
  pages={13648-13657}
}
Group activity recognition is a crucial yet challenging problem, whose core lies in fully exploring spatial-temporal interactions among individuals and generating reasonable group representations. However, previous methods either model spatial and temporal information separately, or directly aggregate individual features to form group features. To address these issues, we propose a novel group activity recognition network termed GroupFormer. It captures spatial-temporal contextual information… 

Figures and Tables from this paper

Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition

TLDR
A distinct Dual-path Actor Interaction (Dual-AI) framework, which arranges spatial and temporal transformers in two complementary orders, enhancing actor relations by integrating merits from different spatiotemporal paths, and introduces a novel Multi-scale Actor Contrastive Loss (MAC-Loss) between two interac-tive paths of Dual-AI.

Learning Graph-based Residual Aggregation Network for Group Activity Recognition

TLDR
A novel Graph-based Residual AggregatIon Network (GRAIN) is proposed to model the differences among all persons of the whole group, which is end- to-end trainable and capable of extracting a comprehensive representation and inferring the group activity in an end-to-end manner.

Hunting Group Clues with Transformers for Social Group Activity Recognition

TLDR
A novel framework for social group activity recognition that is designed in such a way that the attention modules identify and then aggregate features relevant to social group activities, generating an effective feature for each social group.

Multi-Perspective Representation to Part-Based Graph for Group Activity Recognition

TLDR
This paper establishes the part-based graphs from different viewpoints designed to model the spatial relations of different parts for an individual and the inter-actor part graph is proposed to explore part-level relations among actors, in which visual relation and location relation are both considered.

COMPOSER: Compositional Learning of Group Activity in Videos

TLDR
This work proposes COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally and achieves a new state-of-the-art 94.5% accuracy with the keypoint-only modality.

Spatio-Temporal Player Relation Modeling for Tactic Recognition in Sports Videos

TLDR
This work presents a novel spatio-temporal relation modeling approach, which captures both detailed player interactions and long-range group dynamics in tactics and is able to comprehensively describe team cooperation over time in a tactic.

Detector-Free Weakly Supervised Group Activity Recognition

TLDR
This work proposes a novel model for group activity recognition that depends neither on bounding box labels nor on object detector, and localizes and encodes partial contexts of a group activity by leveraging the attention mechanism, and represents a video clip as a set of partial context embeddings.

COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality

TLDR
This work proposes COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally, and demonstrates the model’s strength and interpretability on two widely-used datasets.

Pyramid Region-based Slot Attention Network for Temporal Action Proposal Generation

TLDR
A novel Pyramid Region-based Slot Attention Network termed PRSA-Net is presented to learn a unified visual representation with rich temporal and semantic context for better proposal generation and outperforms other state-of-the-art methods.

One-Shot Deep Model for End-to-End Multi-Person Activity Recognition

TLDR
To the best of the knowledge, TrAct-Net is the first end-to-end trainable model to solve the whole problem in a one-shot manner and achieves superior performance to combinations of state-of-the-arts with much fewer model parameters and faster inference speed.

References

SHOWING 1-10 OF 48 REFERENCES

stagNet: An Attentive Semantic RNN for Group Activity Recognition

TLDR
A novel attentive semantic recurrent neural network (RNN) for understanding group activities in videos, dubbed as stagNet, is proposed, based on the spatio-temporal attention and semantic graph, and adopted to attend to key persons/frames for improved performance.

Progressive Relation Learning for Group Activity Recognition

TLDR
A novel method based on deep reinforcement learning to progressively refine the low-level features and high-level relations of group activities and construct a semantic relation graph (SRG) to explicitly model the relations among persons.

Empowering Relational Network by Self-attention Augmented Conditional Random Fields for Group Activity Recognition

TLDR
Simulations show that the proposed approach surpasses the state-of-the-art methods on the widespread Volleyball and Collective Activity datasets.

Actor-Transformers for Group Activity Recognition

TLDR
This paper proposes an actor-transformer model able to learn and selectively extract information relevant for group activity recognition, and achieves state-of-the-art results on two publicly available benchmarks for Group activity recognition.

A Hierarchical Deep Temporal Model for Group Activity Recognition

TLDR
A 2-stage deep temporal model designed to represent action dynamics of individual people in a sequence and another LSTM model is designed to aggregate person-level information for whole activity understanding is presented.

Learning Actor Relation Graphs for Group Activity Recognition

TLDR
This paper proposes to build a flexible and efficient Actor Relation Graph (ARG) to simultaneously capture the appearance and position relation between actors, and performs extensive experiments on two standard group activity recognition datasets.

HiRF: Hierarchical Random Field for Collective Activity Recognition in Videos

TLDR
This paper addresses the problem of recognizing and localizing coherent activities of a group of people, called collective activities, in video with a new deep model, called Hierarchical Random Field (HiRF), which models only hierarchical dependencies between model variables.

Discriminative Latent Models for Recognizing Contextual Group Activities

TLDR
This paper proposes a novel framework for recognizing group activities which jointly captures the group activity, the individual person actions, and the interactions among them and introduces a new feature representation called the action context (AC) descriptor.

Joint Learning of Social Groups, Individuals Action and Sub-group Activities in Videos

TLDR
This paper proposes an end-to-end trainable framework for the social task, and sets the state-of-the-art results on two widely adopted benchmarks for the traditional group activity recognition task.

Social Scene Understanding: End-to-End Multi-person Action Localization and Collective Activity Recognition

TLDR
A single architecture is proposed that does not rely on external detection algorithms but rather is trained end-to-end to generate dense proposal maps that are refined via a novel inference scheme.