ACTION-Net: Multipath Excitation for Action Recognition

  title={ACTION-Net: Multipath Excitation for Action Recognition},
  author={Zhengwei Wang and Qi She and Aljosa Smolic},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Zhengwei WangQi SheA. Smolic
  • Published 11 March 2021
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Spatial-temporal, channel-wise, and motion patterns are three complementary and crucial types of information for video action recognition. Conventional 2D CNNs are computationally cheap but cannot catch temporal relationships; 3D CNNs can achieve good performance but are computationally intensive. In this work, we tackle this dilemma by designing a generic and effective module that can be embedded into 2D CNNs. To this end, we propose a spAtio-temporal, Channel and moTion excitatION (ACTION… 

HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors

A novel spatial-temporal feature learning and fusion framework, termed ESTF, for event stream based human activity recognition, which projects the event streams into spatial and temporal embeddings using StemNet and encodes and fuses the dual-view representations using Transformer networks.

STSM: Spatio-Temporal Shift Module for Efficient Action Recognition

A plug-and-play Spatio-Temporal Shift Module (STSM), which is a both effective and high-performance module that can be easily inserted into other networks to increase or enhance the ability of the network to learn spatio-temporal features, effectively improving performance without increasing the number of parameters and computational complexity.

Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation

It is demonstrated that the proposed method architecture outperforms previous CNN-based methods in terms of “Val Top-1 %” measure with Something-Something v1 and Jester datasets, while the META yielded competitive results with the Moment-in-Time Mini dataset.

Multi-grained Spatio-Temporal Features Perceived Network for Event-based Lip-Reading

A novel Multi-grained Spatio-Temporal Features Perceived Network (MSTP) is proposed to perceive fine- grained spatio-temporal features from microsecond time-resolved event data and the first event-based lip-reading dataset (DVS-Lip) is presented.

TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding

This work introduces sampling the input for the network from partially decoded videos based on the GOP-level, and proposes a plug-and-play mul T i-modal l EA rning M odule (TEAM) for training the network using information from I-frames and P-frames in an end-to-end manner.

CycDA: Unsupervised Cycle Domain Adaptation to Learn from Image to Video

CycDA is proposed, a cycle-based approach for unsupervised image-to-video domain adaptation that leverages the joint spatial information in images and videos and trains an independent spatio-temporal model to bridge the modality gap.

Dynamic Temporal Filtering in Video Models

This paper presents a new recipe of temporal feature learning, namely Dynamic Temporal Filter (DTF), that novelly performs spatial-aware temporal modeling in frequency domain with large temporal receptive field and dynamically learns a specialized frequency filter for every spatial location to model its long-range temporal dynamics.

Real-Time Risk Assessment for Road Transportation of Hazardous Materials Based on GRU-DNN with Multimodal Feature Embedding

The proposed model outperforms other widely used models based on the overall comparisons of ACC, AUC, F1 and PR-RE curve and prediction similarity can be used as an effective approach for robustness improvement, with the launched adversarial attacks being detected at a high success rate.

STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition

The proposed SpatioTemporAl cRoss (STAR)-transformer, which can effectively represent two cross-modal features as a recognizable vector, achieves a promising improvement in performance in comparison to previous state-of-the-art methods.

DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos

A Dual Path multi-excitation Collaborative Network (DPCNet) is proposed to learn the critical information for facial expression representation from fewer keyframes in videos and designs a multi-frame regularization loss to enforce the representation of multiple frames in the dual view to be semantically coherent.



STM: SpatioTemporal and Motion Encoding for Action Recognition

This work proposes a STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and aChannel-wise Motion Module(CMM) to efficiently encode motion features, and replaces original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network.

TSM: Temporal Shift Module for Efficient Video Understanding

A generic and effective Temporal Shift Module (TSM) that can achieve the performance of 3D CNN but maintain 2D CNN’s complexity and is extended to online setting, which enables real-time low-latency online video recognition and video object detection.

Squeeze-and-Excitation Networks

This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

TEA: Temporal Excitation and Aggregation for Action Recognition

This paper proposes a Temporal Excitation and Aggregation block, including a motion excitation module and a multiple temporal aggregation module, specifically designed to capture both short- and long-range temporal evolution, and achieves impressive results at low FLOPs on several action recognition benchmarks.

TEINet: Towards an Efficient Architecture for Video Recognition

The proposed TEINet can achieve a good recognition accuracy on these datasets but still preserve a high efficiency, and is able to capture temporal structure flexibly and effectively, but also efficient for model inference.

Motion Feature Network: Fixed Motion Filter for Action Recognition

This paper proposes MFNet (Motion Feature Network) containing motion blocks which make it possible to encode spatio-temporal information between adjacent frames in a unified network that can be trained end-to-end.

EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition

A new benchmark dataset named EgoGesture is introduced with sufficient size, variation, and reality to be able to train deep neural networks and provides an in-depth analysis on input modality selection and domain adaptation between different scenes.