Cascade multi-head attention networks for action recognition

  title={Cascade multi-head attention networks for action recognition},
  author={Jiaze Wang and Xiaojiang Peng and Yu Qiao},
  journal={Comput. Vis. Image Underst.},

Learning to Combine the Modalities of Language and Video for Temporal Moment Localization

Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips

This work proposes a simple yet effective Stacked Temporal Attention Module (STAM) to compute temporal attention based on the global knowledge across clips for emphasizing the most discriminative features by stacking multiple self-attention layers.

MFI: Multi-range Feature Interchange for Video Action Recognition

This paper proposes a novel network to capture these two features in a unified 2D framework, and replaces original bottleneck blocks in the ResNet with STI blocks and insert several GRI modules between STI Blocks, to form a Multi-range Feature Interchange (MFI) Network.

Multi-level Attention Fusion Network for Audio-visual Event Recognition

This study proposes the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition that can effectively improve the accuracy in audio-visual event classification.

A resource conscious human action recognition framework using 26-layered deep convolutional neural network

A new 26-layered Convolutional Neural Network (CNN) architecture for accurate complex action recognition is designed and a feature selection method name Poisson distribution along with Univariate Measures (PDaUM) is proposed.

LR-GCN: Latent Relation-Aware Graph Convolutional Network for Conversational Emotion Recognition

This paper proposes a novel approach named Latent Relation-Aware Graph Convolutional Network (LR-GCN), where both speaker dependency of the interlocutors is leveraged and latent correlations among the utterances are captured for ERC.

Whole-Body Keypoint and Skeleton Augmented RGB Networks for Video Action Recognition

This work investigates a new data modality in which Whole-Body Keypoint and Skeleton labels are used to capture refined body information, and designs an architecture that takes advantage of both three-dimensional convolutional neural networks and the Swin transformer to extract spatiotemporal features, resulting in advanced performance.

Learning Downstream Task by Selectively Capturing Complementary Knowledge from Multiple Self-supervisedly Learning Pretexts

This paper technically proposes a novel solution by leveraging the attention mechanism to adaptively squeeze suitable representations for the tasks and theoretically proves that gathering representation from diverse pretexts is more effective than a single one.

Analysis of Deep Neural Networks for Human Activity Recognition in Videos—A Systematic Literature Review

This systematic study explores various deep learning techniques available for HAR, challenges researchers to face to build a robust model, and state-of-the-art datasets used for evaluation, and explores the recent advancements in stratified self-deriving feature-based deep learning architectures.

Deep Learning-Based Artistic Inheritance and Cultural Emotion Color Dissemination of Qin Opera

  • Hang Yu
  • Computer Science
    Frontiers in Psychology
  • 2022
The proposed emotion analysis model of Qin opera based on attention residual network (ResNet) has high emotional classification accuracy for Qin opera, and with the increase of the number of data sets, the model will train a better classification effect.



Lattice Long Short-Term Memory for Human Action Recognition

This work proposes Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations, which effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing the model complexity.

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos

A unified Spatio-Temporal Attention Networks (STAN) is proposed in the context of multiple modalities, which differs from conventional deep networks, which focus on the attention mechanism, because the authors' temporal attention provides a principled and global guidance across different modalities and video segments.

Long-term recurrent convolutional networks for visual recognition and description

A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

VideoLSTM convolves, attends and flows for action recognition

Effective Approaches to Attention-based Neural Machine Translation

A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions.

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification

It is shown that temporal information, especially longer-term patterns, may not be necessary to achieve competitive results on common trimmed video classification datasets, and a local feature integration framework based on attention clusters is proposed, which achieves competitive results across all of these datasets.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Spatiotemporal Residual Networks for Video Action Recognition

The novel spatiotemporal ResNet is introduced and evaluated using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Convolutional Two-Stream Network Fusion for Video Action Recognition

A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.