Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition

  title={Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition},
  author={Jinliang Zang and Le Wang and Zi-yi Liu and Qilin Zhang and Zhenxing Niu and Gang Hua and Nanning Zheng},
Research in human action recognition has accelerated significantly since the introduction of powerful machine learning tools such as Convolutional Neural Networks (CNNs). However, effective and efficient methods for incorporation of temporal information into CNNs are still being actively explored in the recent literature. Motivated by the popular recurrent attention models in the research area of natural language processing, we propose the Attention-based Temporal Weighted CNN (ATW), which… 

Action Recognition by an Attention-Aware Temporal Weighted Convolutional Neural Network

The Attention-aware Temporal Weighted CNN (ATW CNN) for action recognition in videos, which embeds a visual attention model into a temporal weighted multi-stream CNN, and contributes substantially to the performance gains with the more discriminative snippets by focusing on more relevant video segments.

Convolutional Networks With Channel and STIPs Attention Model for Action Recognition in Videos

A channel and spatial-temporal interest points (STIPs) attention model (CSAM) based on CNNs is proposed to focus on the discriminative channels in networks and the informative spatial motion regions of human actions.

Convolutional Neural Networks with Generalized Attentional Pooling for Action Recognition

The proposed Generalized Attentional Pooling based Convolutional Neural Network algorithm for action recognition in still images achieves the state-of-the-art action recognition accuracy on the large-scale MPII still image dataset.

Low-Latency Human Action Recognition with Weighted Multi-Region Convolutional Neural Network

The proposed WMR ConvNet achieves the state-of-the-art performance among competing low-latency algorithms, and even outperforms the 3D ConvNet based C3D algorithm that requires video frame accumulation.

Sparse Deep LSTMs with Convolutional Attention for Human Action Recognition

An architecture is proposed for action recognition, including ResNet feature extractor, Conv-Attention-L STM, BiLSTM, and fully connected layers, including a sparse layer after each LSTM layer is added to overcome overfitting.

An Improved Attention-Based Spatiotemporal-Stream Model for Action Recognition in Videos

An improved spatiotemporal attention model based on the two-stream structure to recognize the different actions in videos is proposed and the feasibility and effectiveness of the proposed method is validated on Ping-Pong action dataset and HMDB51 dataset.

Attention-based spatial-temporal hierarchical ConvLSTM network for action recognition in videos

Experimental results show that the authors’ proposed ST-HConvLSTM achieves state-of-the-art performance compared with other recent LSTM-like architectures and attention-based methods.

Refined Spatial Network for Human Action Recognition

A novel stacked spatial network (SSN), which integrates multi-layer feature maps in an end-to-end manner and comprises two components for representing semantic label information and local slenderer spatial information is proposed.

Dual Stream Spatio-Temporal Motion Fusion With Self-Attention For Action Recognition

This research proposes a dual stream spatiotemporal fusion architecture for human action classification that achieves accurate results with much fewer parameters as compared to the traditional deep neural networks.

Deep Spatial–Temporal Model Based Cross-Scene Action Recognition Using Commodity WiFi

A deep learning framework that integrates spatial features learned from the convolutional neural network (CNN) into the temporal model multilayer bidirectional long short-term memory (Bi-LSTM) and is the first work to explore deep spatial–temporal features for CSI-based action recognition.



Convolutional Two-Stream Network Fusion for Video Action Recognition

A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.

Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks

Factorized spatio-temporal convolutional networks (FstCN) are proposed that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers, followed by learning 1D temporal kernel in the upper layers.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Beyond short snippets: Deep networks for video classification

This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.

Describing Videos by Exploiting Temporal Structure

This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Long-Term Temporal Convolutions for Action Recognition

It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models.

3D Convolutional Neural Networks for Human Action Recognition

A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.

Large-Scale Video Classification with Convolutional Neural Networks

This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

Action recognition with trajectory-pooled deep-convolutional descriptors

This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features, and achieves superior performance to the state of the art on these datasets.