Lattice Long Short-Term Memory for Human Action Recognition

  title={Lattice Long Short-Term Memory for Human Action Recognition},
  author={Lin Sun and Kui Jia and Kevin Chen and Dit-Yan Yeung and Bertram E. Shi and Silvio Savarese},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  • Lin Sun, K. Jia, S. Savarese
  • Published 13 August 2017
  • Computer Science
  • 2017 IEEE International Conference on Computer Vision (ICCV)
Human actions captured in video sequences are threedimensional signals characterizing visual appearance and motion dynamics. [] Key Method Additionally, we introduce a novel multi-modal training procedure for training our network.

Figures and Tables from this paper

Memory-Augmented Temporal Dynamic Learning for Action Recognition
This work proposes a memory-augmented temporal dynamic learning network, which learns to write the most evident information into an external memory module and ignore irrelevant ones, and presents a differential memory controller to make a discrete decision on whether the external memory modules should be updated with current feature.
Recurrent Spatiotemporal Feature Learning for Action Recognition
The proposed architecture is end-to-end trainable, and of significant flexibility to be adapted in any CNN-based structure, and produces the state-of-the-art performance on two standard benchmark for action recognition over RNN-based approaches.
Relational Long Short-Term Memory for Video Action Recognition
This paper presents a new variant of Long Short-Term Memory, namely Relational LSTM, to address the challenge of relation reasoning across space and time between objects, and proposes a two-branch neural architecture consisting of the RelationalLSTM module as the non-local branch and a spatio-temporal pooling based local branch.
Action Recognition Based on Linear Dynamical Systems with Deep Features in Videos
Experimental results show that proposed framework simultaneously expresses spatial and temporal structures, which in turn produce state-of-the-art results.
Temporal Action Localization Using Long Short-Term Dependency
A novel method, referred to as the Gemini Network, is developed for effective modeling of temporal structures and achieving high-performance temporal action localization on two challenging datasets, namely, THUMOS14 and ActivityNet.
Attend It Again: Recurrent Attention Convolutional Neural Network for Action Recognition
This study improves the performance of recurrent attention convolutional neural network (RACNN) by proposing a novel model, “attention-again”, which is a variant from traditional attention model for recognizing human activities and is embedded in two long short-term memory (LSTM) layers.
Temporal Segment Connection Network for Action Recognition
The proposed temporal segment connection network can effectively improve the utilization rate of temporal information and the ability of overall action representation, thus significantly improves the accuracy of human action recognition.
A motion-aware ConvLSTM network for action recognition
A spatio-temporal video recognition network where a motion-aware long short-term memory module is introduced to estimate the motion flow along with extracting spatio/temporal features and a specific optical flow estimator is subsumed which is based on kernelized cross correlation.
Multi-stream Convolutional Neural Networks for Action Recognition in Video Sequences Based on Adaptive Visual Rhythms
A multi-stream network is the architecture of choice to incorporate temporal information, since it may benefit from pre-trained deep networks for images and from handcrafted features for initialization, and its training cost is usually lower than video-based networks.


Long-term recurrent convolutional networks for visual recognition and description
A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Spatiotemporal Residual Networks for Video Action Recognition
The novel spatiotemporal ResNet is introduced and evaluated using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.
Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks
Factorized spatio-temporal convolutional networks (FstCN) are proposed that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers, followed by learning 1D temporal kernel in the upper layers.
Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition
This paper argues that large-scale action recognition in video can be greatly improved by providing an additional modality in training data - namely, 3D human-skeleton sequences - aimed at
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.
Unsupervised Learning of Video Representations using LSTMs
This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.
Action Recognition using Visual Attention
A soft attention based model using multi-layered Recurrent Neural Networks with Long Short-Term Memory units which are deep both spatially and temporally for action recognition in videos.
VideoLSTM convolves, attends and flows for action recognition
Two-Stream Convolutional Networks for Action Recognition in Videos
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Convolutional Two-Stream Network Fusion for Video Action Recognition
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.