GCF-Net: Gated Clip Fusion Network for Video Action Recognition

@inproceedings{Hsiao2021GCFNetGC,
  title={GCF-Net: Gated Clip Fusion Network for Video Action Recognition},
  author={Jenhao Hsiao and Jiawei Chen and Chiu Man Ho},
  booktitle={ECCV Workshops},
  year={2021}
}
In recent years, most of the accuracy gains for video action recognition have come from the newly designed CNN architectures (e.g., 3D-CNNs). These models are trained by applying a deep CNN on single clip of fixed temporal length. Since each video segment are processed by the 3D-CNN module separately, the corresponding clip descriptor is local and the inter-clip relationships are inherently implicit. Common method that directly averages the clip-level outputs as a video-level prediction is… 

Language-guided Multi-Modal Fusion for Video Action Recognition

This paper uses a language-guided contrastive learning to largely augment the video data to support the learning of multi-modal network and successfully elevates the accuracy of video action recognition on a large-scale benchmark video dataset.

TBAC: Transformers Based Attention Consensus for Human Activity Recognition

The use of the TBAC module in place of classical consensus can improve the performance of the CNN-based action recognition models, such as Channel Separated Convolutional Network (CSN), Temporal Shift Module (TSM), and Temporal Segment Network (TSN).

VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP

The proposed VideoCLIP is evaluated on two benchmark video-text datasets, MSRVTT and DiDeMo, and the results show that the model can outperform existing state-of-the-art methods while the retrieval speed is much faster than the traditional query-agnostic search model.

References

SHOWING 1-10 OF 29 REFERENCES

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

Long-Term Temporal Convolutions for Action Recognition

It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models.

SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition

It is demonstrated that the computational cost of action recognition on untrimmed videos can be dramatically reduced by invoking recognition only on these most salient clips, and this yields significant gains in recognition accuracy compared to analysis of all clips or randomly selected clips.

A Closer Look at Spatiotemporal Convolutions for Action Recognition

A new spatiotemporal convolutional block "R(2+1)D" is designed which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51.

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

This paper devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 x3 x 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time.

Watching a Small Portion could be as Good as Watching All: Towards Efficient Video Classification

An end-to-end deep reinforcement approach which enables an agent to classify videos by watching a very small portion of frames by incorporating an adaptive stop network to measure confidence score and generate timely trigger to stop the agent watching videos, which improves efficiency without loss of accuracy.

Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos

This paper introduces a proposal method that aims to recover temporal segments containing actions in untrimmed videos and introduces a learning framework to represent and retrieve activity proposals.

Temporal Shift Module for Efficient Video Understanding

A generic and effective Temporal Shift Module (TSM) that can achieve the performance of 3D CNN but maintain 2D complexity and ranks the first on both Something-Something V1 and V2 leaderboards upon this paper’s submission.

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

This work proposes a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable, and shows that even non-attention based models learn to localize discriminative regions of input image.