• Corpus ID: 239998721

Temporal-attentive Covariance Pooling Networks for Video Recognition

@article{Gao2021TemporalattentiveCP,
  title={Temporal-attentive Covariance Pooling Networks for Video Recognition},
  author={Zilin Gao and Qilong Wang and Bingbing Zhang and Qinghua Hu and Peihua Li},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.14381}
}
For video recognition task, a global representation summarizing the whole contents of the video snippets plays an important role for the final performance. However, existing video architectures usually generate it by using a simple, global average pooling (GAP) method, which has limited ability to capture complex dynamics of videos. For image recognition task, there exist evidences showing that covariance pooling has stronger representation ability than GAP. Unfortunately, such plain covariance… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 75 REFERENCES
ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
TLDR
A new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video and outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.
Deep Temporal Linear Encoding Networks
TLDR
A new video representation, called temporal linear encoding (TLE) and embedded inside of CNNs as a new layer, which captures the appearance and motion throughout entire videos, and outperforms current state-of-the-art methods on both datasets.
Video Modeling With Correlation Networks
TLDR
This paper proposes an alternative approach based on a learnable correlation operator that can be used to establish frame-to-frame matches over convolutional feature maps in the different layers of the network.
Purely Attention Based Local Feature Integration for Video Classification.
TLDR
This paper investigates the potential of a purely attention-based local feature integration over the channel dimension and proposes the channel pyramid attention schema, which splits features into sub-features at multiple scales for coarse-to-fine sub-feature interaction modeling.
Convolutional Two-Stream Network Fusion for Video Action Recognition
TLDR
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.
Temporal Segment Networks for Action Recognition in Videos
TLDR
The proposed TSN framework, called temporal segment network (TSN), aims to model long-range temporal structure with a new segment-based sampling and aggregation scheme and won the video classification track at the ActivityNet challenge 2016 among 24 teams.
ECO: Efficient Convolutional Network for Online Video Understanding
TLDR
A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Appearance-and-Relation Networks for Video Classification
  • Limin Wang, Wei Li, Wen Li, L. Gool
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner, constructed by stacking multiple generic building blocks, called SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner.
Gate-Shift Networks for Video Action Recognition
TLDR
An extensive evaluation of the proposed Gate-Shift Module is performed to study its effectiveness in video action recognition, achieving state-of-the-art results on Something Something-V1 and Diving48 datasets, and obtaining competitive results on EPIC-Kitchens with far less model complexity.
...
1
2
3
4
5
...