Learning Long-Term Dependencies for Action Recognition with a Biologically-Inspired Deep Network

  title={Learning Long-Term Dependencies for Action Recognition with a Biologically-Inspired Deep Network},
  author={Yemin Shi and Yonghong Tian and Yaowei Wang and Wei Zeng and Tiejun Huang},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
Despite a lot of research efforts devoted in recent years, how to efficiently learn long-term dependencies from sequences still remains a pretty challenging task. As one of the key models for sequence learning, recurrent neural network (RNN) and its variants such as long short term memory (LSTM) and gated recurrent unit (GRU) are still not powerful enough in practice. One possible reason is that they have only feedforward connections, which is different from the biological neural system that is… 

Figures and Tables from this paper

Learning long-range spatial dependencies with horizontal gated-recurrent units
This work introduces the horizontal gated-recurrent unit (hGRU) to learn intrinsic horizontal connections -- both within and across feature columns, and demonstrates that a single hGRU layer matches or outperforms all tested feedforward hierarchical baselines including state-of-the-art architectures which have orders of magnitude more free parameters.
MCRM: Mother Compact Recurrent Memory A Biologically Inspired Recurrent Neural Network Architecture
Empirical results show that MCRMs outperform previously used architectures and have a compact memory pattern consisting of neurons that act explicitly in both long-term and short-term fashions.
Temporal Feedback Convolutional Recurrent Neural Networks for Keyword Spotting
This work proposes a novel convolutional recurrent neural network architecture with temporal feedback connections, inspired by the feedback pathways from the brain to ears in the human auditory system, and shows the proposed model consistently outperforms the compared model without temporal feedback for different input/output settings in the CRNN framework.
Temporal Attentive Network for Action Recognition
The key idea in TAN is that not all postures contribute equally to the successful recognition of an action, so a temporal attention mechanism is introduced in the form of Long-Short Term Memory (LSTM) network.
MCRM: Mother Compact Recurrent Memory
Empirical results show that MCRMs outperform previously used architectures and have a compact memory pattern consisting of neurons that acts explicitly in both long-term and short-term fashions.
ODN: Opening the Deep Network for Open-Set Action Recognition
The Open Deep Network (ODN) is proposed, which can effectively detect and recognize new categories with little human intervention and can even achieve comparable performance to some closed-set methods.
Temporal Action Localization Using Long Short-Term Dependency
A novel method, referred to as the Gemini Network, is developed for effective modeling of temporal structures and achieving high-performance temporal action localization on two challenging datasets, namely, THUMOS14 and ActivityNet.
Human Action Recognition in Unconstrained Trimmed Videos Using Residual Attention Network and Joints Path Signature
Experiments on three benchmark datasets indicate that the proposed new framework that involves residual-attention module and joint path-signature feature (JPSF) representation framework achieves state-of-the-art performance.
Deep Concept-wise Temporal Convolutional Networks for Action Localization
Existing action localization approaches adopt shallow temporal convolutional networks (i.e., TCN) on 1D feature map extracted from video frames. In this paper, we empirically find that stacking more


A Clockwork RNN
This paper introduces a simple, yet powerful modification to the simple RNN architecture, the Clockwork RNN (CW-RNN), in which the hidden layer is partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate.
Long-term recurrent convolutional networks for visual recognition and description
A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Recurrent nets that time and count
  • F. Gers, J. Schmidhuber
  • Computer Science
    Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium
  • 2000
Surprisingly, LSTM augmented by "peephole connections" from its internal cells to its multiplicative gates can learn the fine distinction between sequences of spikes separated by either 50 or 49 discrete time steps, without the help of any short training exemplars.
Long Short-Term Memory
A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Learning Longer Memory in Recurrent Neural Networks
This paper shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent, by using a slight structural modification of the simple recurrent neural network architecture.
Hierarchical Recurrent Neural Networks for Long-Term Dependencies
This paper proposes to use a more general type of a-priori knowledge, namely that the temporal dependencies are structured hierarchically, which implies that long-term dependencies are represented by variables with a long time scale.
Beyond short snippets: Deep networks for video classification
This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.
Long-Term Temporal Convolutions for Action Recognition
It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models.
Describing Videos by Exploiting Temporal Structure
This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN.
Translating Videos to Natural Language Using Deep Recurrent Neural Networks
This paper proposes to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure, to create sentence descriptions of open-domain videos with large vocabularies.