• Corpus ID: 235294205

TSI: Temporal Saliency Integration for Video Action Recognition

  title={TSI: Temporal Saliency Integration for Video Action Recognition},
  author={Haisheng Su and Jinyuan Feng and Dongliang Wang and Weihao Gan and Wei Wu and Y. Qiao},
Efficient spatiotemporal modeling is an important yet challenging problem for video action recognition. Existing state-of-the-art methods exploit motion clues to assist in short-term temporal modeling through temporal difference over consecutive frames. However, insignificant noises will be inevitably introduced due to the camera movement. Besides, movements of different actions can vary greatly. In this paper, we propose a Temporal Saliency Integration (TSI) block, which mainly contains a… 

Figures and Tables from this paper


StNet: Local and Global Spatial-Temporal Modeling for Action Recognition
A novel spatial temporal network (StNet) architecture for both local and global spatial-temporal modeling in videos that outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity is explored.
TEA: Temporal Excitation and Aggregation for Action Recognition
This paper proposes a Temporal Excitation and Aggregation block, including a motion excitation module and a multiple temporal aggregation module, specifically designed to capture both short- and long-range temporal evolution, and achieves impressive results at low FLOPs on several action recognition benchmarks.
Temporal Distinct Representation Learning for Action Recognition
This paper designs a sequential channel filtering mechanism, i.e., Progressive Enhancement Module (PEM), to excite the discriminative channels of features from different frames step by step to avoid repeated information extraction and achieves visible improvements over the best competitor by 2.4% and 1.3%, respectively.
STM: SpatioTemporal and Motion Encoding for Action Recognition
This work proposes a STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and aChannel-wise Motion Module(CMM) to efficiently encode motion features, and replaces original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network.
TEINet: Towards an Efficient Architecture for Video Recognition
The proposed TEINet can achieve a good recognition accuracy on these datasets but still preserve a high efficiency, and is able to capture temporal structure flexibly and effectively, but also efficient for model inference.
ECO: Efficient Convolutional Network for Online Video Understanding
A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.
Transferable Knowledge-Based Multi-Granularity Fusion Network for Weakly Supervised Temporal Action Detection
A novel framework to handle temporal action detection under weak supervision by utilizing convolutional kernels with varied dilation rates to enlarge the receptive fields and a cascaded module with the proposed Online Adversarial Erasing mechanism to further mine more relevant regions of target actions by feeding the erased-feature maps of discovered regions back into the system.
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.
Gate-Shift Networks for Video Action Recognition
An extensive evaluation of the proposed Gate-Shift Module is performed to study its effectiveness in video action recognition, achieving state-of-the-art results on Something Something-V1 and Diving48 datasets, and obtaining competitive results on EPIC-Kitchens with far less model complexity.
Recognize Actions by Disentangling Components of Dynamics
A new ConvNet architecture for video representation learning is proposed, which can derive disentangled components of dynamics purely from raw video frames, without the need of optical flow estimation.