Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition

  title={Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition},
  author={Hongjian Guo and Hanjing Wang and Qian Ji},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
A complex action consists of a sequence of atomic actions that interact with each other over a relatively long period of time. This paper introduces a probabilistic model named Uncertainty-Guided Probabilistic Transformer (UGPT) for complex action recognition. The self-attention mechanism of a Transformer is used to capture the complex and long-term dynamics of the complex actions. By explicitly modeling the distribution of the attention scores, we extend the deterministic Transformer to a… 

Figures and Tables from this paper

Uncertainty-Based Spatial-Temporal Attention for Online Action Detection

This paper introduces a two-stream framework that combines the baseline model and the probabilistic model based on the input uncertainty and quantifies the predictive uncertainty to generate spatial-temporal attention that focus on large mutual information regions and frames.

Human Action Recognition: A Taxonomy-Based Survey, Updates, and Opportunities

A taxonomy-based, rigorous study of human activity recognition techniques, discussing the best ways to acquire human action features, derived using RGB and depth data, as well as the latest research on deep learning and hand-crafted techniques.

Vision Transformers for Action Recognition: A Survey

This survey provides the first comprehensive survey of vision transformer techniques for action recognition and investigates different network learning strategies, such as self-supervised and zero-shot learning, along with their associated losses for transformer-based action recognition.

Human Action Recognition From Various Data Modalities: A Review

This article presents a comprehensive survey of recent progress in deep learning methods for HAR based on the type of input data modality, including the fusion-based and the co-learning-based frameworks.

Efficiency 360: Efficient Vision Transformers

This paper discusses the efficiency of transformers in terms of memory, computation cost, and performance of models, including accuracy, the robustness of the model, and fair \&bias-free features, and introduces an efficient 360 framework, which includes various aspects of the vision transformer to make it more efficient for industrial applications.



Exploring privileged information from simple actions for complex action recognition

The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities

The HTK toolkit is evaluated, a state-of-the-art speech recognition engine, in combination with multiple video feature descriptors, for both the recognition of cooking activities as well as the semantic parsing of videos into action units.

Asynchronous Temporal Fields for Action Recognition

This work proposes a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network.

Learning latent temporal structure for complex event detection

A conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks is utilized.

Bayesian Attention Belief Networks

On a variety of language understanding tasks, this paper shows that the proposed Bayesian attention belief networks method outperforms deterministic attention and state-of-the-art stochastic attention in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.

Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection

This work contributes a novel approach using a probabilistic representational model in combination with transformers to explicitly reason under uncertainties, namely uncertainty-guided transformer reasoning (UGTR), for camouflaged object detection.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Bayesian Transformer Language Models for Speech Recognition

  • Boyang XueJ. Yu H. Meng
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
A full Bayesian learning framework for Transformer LM estimation is proposed and efficient variational inference based approaches are used to estimate the latent parameter posterior distributions associated with different parts of the Transformer model architecture including multi-head self-attention, feed forward and embedding layers.

Timeception for Complex Action Recognition

Timeception achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions and MultiTHUMOS, and it is demonstrated that Timeception learns long-range temporal dependencies and tolerate temporal extents of complex actions.

Graph-based High-order Relation Modeling for Long-term Action Recognition

A Graph-based High-order Relation Modeling (GHRM) module to exploit the high-order relations in the long-term actions for long- term action recognition and a GHRM-layer consisting of a Temporal-G HRM branch and a Semantic-GHRm branch, which aims to model the local temporal high- order relations and global semantic high- Order relations.