TEA: Temporal Excitation and Aggregation for Action Recognition

@article{Li2020TEATE,
  title={TEA: Temporal Excitation and Aggregation for Action Recognition},
  author={Y. Li and Bin Ji and Xintian Shi and Jianguo Zhang and Bin Kang and Limin Wang},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2020},
  pages={906-915}
}
  • Y. Li, Bin Ji, +3 authors Limin Wang
  • Published 2020
  • Computer Science
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short- and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences… Expand
TSI: Temporal Saliency Integration for Video Action Recognition
TLDR
This paper proposes a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module, designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively. Expand
NUTA: Non-uniform Temporal Aggregation for Action Recognition
TLDR
This work proposes a method called the non-uniform temporal aggregation (NUTA), which aggregates features only from informative temporal segments, and introduces a synchronization method that allows the NUTA features to be temporally aligned with traditional uniformly sampled video features, so that both local and clip-level features can be combined. Expand
Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition
TLDR
A rich and robust motion representation based on spatio-temporal self-similarity (STSS), which effectively captures long-term interaction and fast motion in the video, leading to robust action recognition. Expand
EAN: Event Adaptive Network for Enhanced Action Recognition
  • Yuan Tian, Yichao Yan, +4 authors Zhiyong Gao
  • Computer Science
  • ArXiv
  • 2021
TLDR
A unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs, which are adaptive to the input video content and a novel and efficient Latent Motion Code module, further improving the performance of the framework. Expand
Learning Comprehensive Motion Representation for Action Recognition
TLDR
A Comprehensive Motion Representation (CMR) learning method for action recognition, which achieves competitive performance on Something-Something V1 & V2 and Kinetics-400, and outperforms the current state-of-the-art on the temporal reasoning datasets Something- something V1 and V2. Expand
Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition
TLDR
A multi-scale spatial temporal graph convolutional network (MST-GCN), which stacks multiple blocks to learn effective motion representations for action recognition, achieves remarkable performance on three challenging benchmark datasets. Expand
AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition
TLDR
An adaptive temporal fusion network that dynamically fuses channels from current and past feature maps for strong temporal modelling, called AdaFuse, that can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods. Expand
Adaptive Recursive Circle Framework for Fine-grained Action Recognition
  • Hanxi Lin, Xinxiao Wu, Jiebo Luo
  • Computer Science
  • ArXiv
  • 2021
TLDR
An Adaptive Recursive Circle (ARC) framework is proposed, a fine-grained decorator for pure feedforward layers that can facilitate fine- grained action recognition by introducing deeply refined features and multi-scale receptive fields at a low cost. Expand
Video-ception Network: Towards Multi-Scale Efficient Asymmetric Spatial-Temporal Interactions
TLDR
A novel video representing method that fuses the features spatially and temporally in an asymmetric way to model action atomics spanning multi-scale spatial-temporal scales is proposed and verified on several most recent large-scale video datasets requiring strong temporal reasoning or appearance discriminating. Expand
Shifted Chunk Transformer for Spatio-Temporal Representational Learning
Spatio-temporal representational learning has been widely adopted in various fields such as action recognition, video object segmentation, and action anticipation. Previous spatio-temporalExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 56 REFERENCES
Trajectory Convolution for Action Recognition
TLDR
This work proposes a new CNN architecture TrajectoryNet, which incorporates trajectory convolution, a new operation for integrating features along the temporal dimension, to replace the existing temporal convolution. Expand
StNet: Local and Global Spatial-Temporal Modeling for Action Recognition
TLDR
A novel spatial temporal network (StNet) architecture for both local and global spatial-temporal modeling in videos that outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity is explored. Expand
Long-Term Temporal Convolutions for Action Recognition
TLDR
It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models. Expand
ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
TLDR
A new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video and outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks. Expand
STM: SpatioTemporal and Motion Encoding for Action Recognition
TLDR
This work proposes a STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and aChannel-wise Motion Module(CMM) to efficiently encode motion features, and replaces original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network. Expand
Describing Videos by Exploiting Temporal Structure
TLDR
This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN. Expand
Recognize Actions by Disentangling Components of Dynamics
TLDR
A new ConvNet architecture for video representation learning is proposed, which can derive disentangled components of dynamics purely from raw video frames, without the need of optical flow estimation. Expand
Temporal Bilinear Networks for Video Action Recognition
TLDR
This paper proposes a novel Temporal Bilinear (TB) model to capture the temporal pairwise feature interactions between adjacent frames and considers explicit quadratic bilinear transformations in the temporal domain for motion evolution and sequential relation modeling. Expand
Spatiotemporal Multiplier Networks for Video Action Recognition
TLDR
A general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features that combines the appearance and motion pathways of a two-stream architecture by motion gating and is trained end-to-end. Expand
A Closer Look at Spatiotemporal Convolutions for Action Recognition
TLDR
A new spatiotemporal convolutional block "R(2+1)D" is designed which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51. Expand
...
1
2
3
4
5
...