Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition

@inproceedings{Zang2018AttentionbasedTW,
  title={Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition},
  author={Jinliang Zang and Le Wang and Zi-yi Liu and Qilin Zhang and Zhenxing Niu and Gang Hua and Nanning Zheng},
  booktitle={AIAI},
  year={2018}
}
Research in human action recognition has accelerated significantly since the introduction of powerful machine learning tools such as Convolutional Neural Networks (CNNs). However, effective and efficient methods for incorporation of temporal information into CNNs are still being actively explored in the recent literature. Motivated by the popular recurrent attention models in the research area of natural language processing, we propose the Attention-based Temporal Weighted CNN (ATW), which… Expand
Action Recognition by an Attention-Aware Temporal Weighted Convolutional Neural Network
TLDR
The Attention-aware Temporal Weighted CNN (ATW CNN) for action recognition in videos, which embeds a visual attention model into a temporal weighted multi-stream CNN, and contributes substantially to the performance gains with the more discriminative snippets by focusing on more relevant video segments. Expand
Convolutional Networks With Channel and STIPs Attention Model for Action Recognition in Videos
TLDR
A channel and spatial-temporal interest points (STIPs) attention model (CSAM) based on CNNs is proposed to focus on the discriminative channels in networks and the informative spatial motion regions of human actions. Expand
Convolutional Neural Networks with Generalized Attentional Pooling for Action Recognition
TLDR
The proposed Generalized Attentional Pooling based Convolutional Neural Network algorithm for action recognition in still images achieves the state-of-the-art action recognition accuracy on the large-scale MPII still image dataset. Expand
Low-Latency Human Action Recognition with Weighted Multi-Region Convolutional Neural Network
TLDR
The proposed WMR ConvNet achieves the state-of-the-art performance among competing low-latency algorithms, and even outperforms the 3D ConvNet based C3D algorithm that requires video frame accumulation. Expand
An Improved Attention-Based Spatiotemporal-Stream Model for Action Recognition in Videos
TLDR
An improved spatiotemporal attention model based on the two-stream structure to recognize the different actions in videos is proposed and the feasibility and effectiveness of the proposed method is validated on Ping-Pong action dataset and HMDB51 dataset. Expand
Attention-based spatial-temporal hierarchical ConvLSTM network for action recognition in videos
TLDR
Experimental results show that the authors’ proposed ST-HConvLSTM achieves state-of-the-art performance compared with other recent LSTM-like architectures and attention-based methods. Expand
Refined Spatial Network for Human Action Recognition
TLDR
A novel stacked spatial network (SSN), which integrates multi-layer feature maps in an end-to-end manner and comprises two components for representing semantic label information and local slenderer spatial information is proposed. Expand
Joint spatial-temporal attention for action recognition
TLDR
To extract robust motion representations of videos, a new spatial attention module based on 3D convolution is proposed, which can pay attention to the salient parts of the spatial areas, and a new bidirectional LSTM based temporal attention module is introduced. Expand
Dual Stream Spatio-Temporal Motion Fusion With Self-Attention For Action Recognition
TLDR
This research proposes a dual stream spatiotemporal fusion architecture for human action classification that achieves accurate results with much fewer parameters as compared to the traditional deep neural networks. Expand
Deep Spatial–Temporal Model Based Cross-Scene Action Recognition Using Commodity WiFi
TLDR
A deep learning framework that integrates spatial features learned from the convolutional neural network (CNN) into the temporal model multilayer bidirectional long short-term memory (Bi-LSTM) and is the first work to explore deep spatial–temporal features for CSI-based action recognition. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 49 REFERENCES
Convolutional Two-Stream Network Fusion for Video Action Recognition
TLDR
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated. Expand
Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks
TLDR
Factorized spatio-temporal convolutional networks (FstCN) are proposed that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers, followed by learning 1D temporal kernel in the upper layers. Expand
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.Expand
Beyond short snippets: Deep networks for video classification
TLDR
This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos. Expand
Describing Videos by Exploiting Temporal Structure
TLDR
This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN. Expand
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Expand
Long-Term Temporal Convolutions for Action Recognition
TLDR
It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models. Expand
3D Convolutional Neural Networks for Human Action Recognition
TLDR
A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. Expand
Large-Scale Video Classification with Convolutional Neural Networks
TLDR
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training. Expand
Action recognition with trajectory-pooled deep-convolutional descriptors
TLDR
This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features, and achieves superior performance to the state of the art on these datasets. Expand
...
1
2
3
4
5
...