ACTION-Net: Multipath Excitation for Action Recognition

@article{Wang2021ACTIONNetME,
  title={ACTION-Net: Multipath Excitation for Action Recognition},
  author={Zhengwei Wang and Qi She and Aljosa Smolic},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={13209-13218}
}
  • Zhengwei Wang, Qi She, A. Smolic
  • Published 11 March 2021
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Spatial-temporal, channel-wise, and motion patterns are three complementary and crucial types of information for video action recognition. Conventional 2D CNNs are computationally cheap but cannot catch temporal relationships; 3D CNNs can achieve good performance but are computationally intensive. In this work, we tackle this dilemma by designing a generic and effective module that can be embedded into 2D CNNs. To this end, we propose a spAtio-temporal, Channel and moTion excitatION (ACTION… 

STSM: Spatio-Temporal Shift Module for Efficient Action Recognition

TLDR
A plug-and-play Spatio-Temporal Shift Module (STSM), which is a both effective and high-performance module that can be easily inserted into other networks to increase or enhance the ability of the network to learn spatio-temporal features, effectively improving performance without increasing the number of parameters and computational complexity.

Multi-grained Spatio-Temporal Features Perceived Network for Event-based Lip-Reading

TLDR
A novel Multi-grained Spatio-Temporal Features Perceived Network (MSTP) is proposed to perceive fine- grained spatio-temporal features from microsecond time-resolved event data and the first event-based lip-reading dataset (DVS-Lip) is presented.

TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding

TLDR
This work introduces sampling the input for the network from partially decoded videos based on the GOP-level, and proposes a plug-and-play mul T i-modal l EA rning M odule (TEAM) for training the network using information from I-frames and P-frames in an end-to-end manner.

Physical Adversarial Attack meets Computer Vision: A Decade Survey

Although Deep Neural Networks (DNNs) have achieved impressive results in computer vision, their exposed vulnerability to adversarial attacks remains a serious concern. A series of works has shown

Spatio-Temporal Self-Supervision Enhanced Transformer Networks for Action Recognition

TLDR
The authors' proposed STTNet can adaptively encode the spatial and temporal enhanced key features, which are respectively learned through the Temporal and Spatial Self-Supervised sub-module using the unlabeled video data, in a nonlinear and nonlocal manner via the Transformer based SpatioTemporal Aggrega-tor.

Spatial-Temporal Pyramid Graph Reasoning for Action Recognition

TLDR
A generic Spatial-Temporal Pyramid Graph Network (STPG-Net) is proposed to adaptively capture long-range spatial-temporal relations in video sequences at multiple scales and can be flexibly integrated into 2D and 3D backbone networks in a plug-and-play manner.

Uncertainty-Driven Action Quality Assessment

TLDR
A novel Uncertainty-Driven AQA (UD-AQA) model to generate multiple predictions only using one single branch is proposed, and an uncertainty-guided training strategy is designed to dynamically adjust the learning order of the samples from low uncertainty to high uncertainty.

Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition

TLDR
A Temporal Patch Shift method for efficient 3D self-attention modeling in transformers for video-based action recognition that achieves competitive performance with state-of-the-arts on Something-something V1 & V2, Diving-48, and Kinetics400 while being much more efficient on computation and memory cost.

Motion Gait: Gait Recognition via Motion Excitation

TLDR
The Motion Excitation Module (MEM) is proposed, which learns the difference information between frames and intervals, so as to obtain the representation of temporal motion changes, and the Fine Feature Extractor (FFE) is presented, which independently learns the spatio-temporal representations of human body according to horizontal parts of individuals.

Bi-Calibration Networks for Weakly-Supervised Video Representation Learning

TLDR
A new design of mutual calibration between query and text to boost weakly-supervised video representation learning is introduced and Bi-Calibration Networks (BCN) is presented that novelly couples two calibrations to learn the amendment from text to query and vice versa.

References

SHOWING 1-10 OF 49 REFERENCES

STM: SpatioTemporal and Motion Encoding for Action Recognition

TLDR
This work proposes a STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and aChannel-wise Motion Module(CMM) to efficiently encode motion features, and replaces original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network.

TSM: Temporal Shift Module for Efficient Video Understanding

TLDR
A generic and effective Temporal Shift Module (TSM) that can achieve the performance of 3D CNN but maintain 2D CNN’s complexity and is extended to online setting, which enables real-time low-latency online video recognition and video object detection.

Squeeze-and-Excitation Networks

TLDR
This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

TLDR
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Deep Residual Learning for Image Recognition

TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

TEA: Temporal Excitation and Aggregation for Action Recognition

TLDR
This paper proposes a Temporal Excitation and Aggregation block, including a motion excitation module and a multiple temporal aggregation module, specifically designed to capture both short- and long-range temporal evolution, and achieves impressive results at low FLOPs on several action recognition benchmarks.

TEINet: Towards an Efficient Architecture for Video Recognition

TLDR
The proposed TEINet can achieve a good recognition accuracy on these datasets but still preserve a high efficiency, and is able to capture temporal structure flexibly and effectively, but also efficient for model inference.

Motion Feature Network: Fixed Motion Filter for Action Recognition

TLDR
This paper proposes MFNet (Motion Feature Network) containing motion blocks which make it possible to encode spatio-temporal information between adjacent frames in a unified network that can be trained end-to-end.

EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition

TLDR
A new benchmark dataset named EgoGesture is introduced with sufficient size, variation, and reality to be able to train deep neural networks and provides an in-depth analysis on input modality selection and domain adaptation between different scenes.