VideoLSTM convolves, attends and flows for action recognition

  title={VideoLSTM convolves, attends and flows for action recognition},
  author={Zhenyang Li and Kirill Gavrilyuk and Efstratios Gavves and Mihir Jain and Cees G. M. Snoek},
  journal={Comput. Vis. Image Underst.},

Figures and Tables from this paper

TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal

This paper proposes a spatio-temporal convolutional network that combines the advantages of regression-based detector and L-RCN by empowering Convolutional LSTM with regression capability and achieves superior performance as compared with the state-of-the-arts.

Interpretable Spatio-Temporal Attention for Video Action Recognition

  • Lili MengBo Zhao L. Sigal
  • Computer Science
    2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)
  • 2019
This model not only improves video action recognition accuracy, but also localizes discriminative regions both spatially and temporally, despite being trained in a weakly-supervised manner with only classification labels (no bounding box labels or time frame temporal labels).

Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos

The experimental results show that, the proposed RSTAN outperforms other recent RNN-based approaches on UCF101 and HMDB51 as well as achieves the state-of-the-art on JHMDB.

Spatio-Temporal Self-Attention Weighted VLAD Neural Network for Action Recognition

This work presents a novel model with VLAD following spatio-temporal selfattention operations, named spatiospecific self-attention weighted VL AD (ST-SAWVLAD), which enabling aggregate not only detailed spatial information but also fine motion information from successive video frames.

NUTA: Non-uniform Temporal Aggregation for Action Recognition

This work proposes a method called the non-uniform temporal aggregation (NUTA), which aggregates features only from informative temporal segments, and introduces a synchronization method that allows the NUTA features to be temporally aligned with traditional uniformly sampled video features, so that both local and clip-level features can be combined.

VideoLightFormer: Lightweight Action Recognition using Transformers

This work proposes a novel, lightweight action recognition architecture, VideoLightFormer, which carefully extends the 2D convolutional Temporal Segment Network with transformers, while maintaining spatial and temporal video structure throughout the entire model.

A Comprehensive Study of Deep Video Action Recognition

A comprehensive survey of over 200 existing papers on deep learning for video action recognition is provided, starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models.

A motion-aware ConvLSTM network for action recognition

A spatio-temporal video recognition network where a motion-aware long short-term memory module is introduced to estimate the motion flow along with extracting spatio/temporal features and a specific optical flow estimator is subsumed which is based on kernelized cross correlation.



Action Recognition using Visual Attention

A soft attention based model using multi-layered Recurrent Neural Networks with Long Short-Term Memory units which are deep both spatially and temporally for action recognition in videos.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

This paper introduces hybrid video classification architectures based on carefully designed unsupervised representations of hand-crafted spatio-temporal features classified by supervised deep networks.

Spatiotemporal Multiplier Networks for Video Action Recognition

A general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features that combines the appearance and motion pathways of a two-stream architecture by motion gating and is trained end-to-end.

Convolutional Two-Stream Network Fusion for Video Action Recognition

A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.

Spatiotemporal Residual Networks for Video Action Recognition

The novel spatiotemporal ResNet is introduced and evaluated using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.

Modeling video evolution for action recognition

The proposed method to capture video-wide temporal information for action recognition postulate that a function capable of ordering the frames of a video temporally captures well the evolution of the appearance within the video.

Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos

A huge leap forward in action detection performance is achieved and 20% and 11% gain in mAP are reported on UCF-101 and J-HMDB-21 datasets respectively when compared to the state-of-the-art.

Delving Deeper into Convolutional Networks for Learning Video Representations

A variant of the GRU model is introduced that leverages the convolution operations to enforce sparse connectivity of the model units and share parameters across the input spatial locations to mitigate the effect of low-level percepts on human action recognition and Video Captioning tasks.