Corpus ID: 236171291

EAN: Event Adaptive Network for Enhanced Action Recognition

  title={EAN: Event Adaptive Network for Enhanced Action Recognition},
  author={Yuan Tian and Yichao Yan and Xiongkuo Min and Guo Lu and Guangtao Zhai and Guodong Guo and Zhiyong Gao},
  • Yuan Tian, Yichao Yan, +4 authors Zhiyong Gao
  • Published 22 July 2021
  • Computer Science
  • ArXiv
Efficiently modeling spatial-temporal information in videos is crucial for action recognition. To achieve this goal, stateof-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only… Expand


Temporal Segment Networks for Action Recognition in Videos
The proposed TSN framework, called temporal segment network (TSN), aims to model long-range temporal structure with a new segment-based sampling and aggregation scheme and won the video classification track at the ActivityNet challenge 2016 among 24 teams. Expand
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.Expand
ECO: Efficient Convolutional Network for Online Video Understanding
A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods. Expand
Self-supervised Motion Representation via Scattering Local Motion Cues
This paper proposes a novel context guided motion Upsampling layer that leverages the spatial context of video objects to learn the upsampling parameters in an efficient way and proves the effectiveness of the proposed motion representation method on downstream video understanding tasks, e.g., action recognition task. Expand
Grouped Spatial-Temporal Aggregation for Efficient Action Recognition
  • Chenxu Luo, A. Yuille
  • Computer Science
  • 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
A novel decomposition method is proposed that decomposes the feature channels into spatial and temporal groups in parallel and can make two groups focus on static and dynamic cues separately, called grouped spatial-temporal aggregation (GST). Expand
Appearance-and-Relation Networks for Video Classification
  • Limin Wang, Wei Li, Wen Li, L. Gool
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner, constructed by stacking multiple generic building blocks, called SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner. Expand
PAN: Towards Fast Action Recognition via Learning Persistence of Appearance
This paper designs a novel motion cue called Persistence of Appearance (PA), which is over 1000x faster than conventional optical flow in terms of motion modeling speed, and devise a global temporal fusion strategy called Various-timescale Aggregation Pooling (VAP) that can adaptively model long-range temporal relationships across various timescales. Expand
PAN: Persistent Appearance Network with an Efficient Motion Cue for Fast Action Recognition
A novel motion cue called Persistence of Appearance (PA) is designed, which enables the network to distill motion information directly from adjacent RGB frames, which achieves state-of-the-art results on three benchmark datasets: UCF101, HMDB51 and Kinetics. Expand
TEA: Temporal Excitation and Aggregation for Action Recognition
This paper proposes a Temporal Excitation and Aggregation block, including a motion excitation module and a multiple temporal aggregation module, specifically designed to capture both short- and long-range temporal evolution, and achieves impressive results at low FLOPs on several action recognition benchmarks. Expand
Attend and Interact: Higher-Order Object Interactions for Video Understanding
It is demonstrated that modeling object interactions significantly improves accuracy for both action recognition and video captioning, while saving more than 3-times the computation over traditional pairwise relationships. Expand