The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions

  title={The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions},
  author={Sepp Hochreiter},
  journal={Int. J. Uncertain. Fuzziness Knowl. Based Syst.},
  • S. Hochreiter
  • Published 1 April 1998
  • Computer Science
  • Int. J. Uncertain. Fuzziness Knowl. Based Syst.
Recurrent nets are in principle capable to store past inputs to produce the currently desired output. Because of this property recurrent nets are used in time series prediction and process control. Practical applications involve temporal dependencies spanning many time steps, e.g. between relevant inputs and desired outputs. In this case, however, gradient based learning methods take too much time. The extremely increased learning time arises because the error vanishes as it gets propagated… 

Linear Antisymmetric Recurrent Neural Networks

This paper suggests a new recurrent network structure called Linear Antisymmetric RNN (LARNN), based on the numerical solution to an Ordinary Differential Equation (ODE) with stability properties resulting in a stable solution, which corresponds to long-term memory.

Learning Long Term Dependencies with Recurrent Neural Networks

It is shown that RNNs and especially normalised recurrent neural networks (NRNNs) unfolded in time are indeed very capable of learning time lags of at least a hundred time steps and it is demonstrated that the problem of a vanishing gradient does not apply to these networks.

Recurrent Neural Networks

This paper shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent, which is achieved by using a slight structural modification of the simple recurrent neural network architecture.

Learning Longer Memory in Recurrent Neural Networks

This paper shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent, by using a slight structural modification of the simple recurrent neural network architecture.

Short-Term Memory Optimization in Recurrent Neural Networks by Autoencoder-based Initialization

An initialization schema that pretrains the weights of a recurrent neural network to approximate the linear autoencoder of the input sequences is introduced and it is shown how such pretraining can better support solving hard classification tasks with long sequences.

Reinforcement learning with recurrent neural networks

RNN can well map and reconstruct (partially observable) Markov decision processes and the resulting inner state of the network can be used as a basis for standard RL algorithms, which forms a novel connection between recurrent neural networks (RNN) and reinforcement learning (RL) techniques.

Backpropagation-decorrelation: online recurrent learning with O(N) complexity

  • J. Steil
  • Computer Science
    2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541)
  • 2004
A new learning rule for fully recurrent neural networks is introduced which combines important principles: one-step backpropagation of errors and the usage of temporal memory in the network dynamics by means of decorrelation of activations.

Learning from Predictions: Fusing Training and Autoregressive Inference for Long-Term Spatiotemporal Forecasts

The results show that BPTT-SA effectively reduces iterative error propagation in convolutional RNNs and Convolutional Autoencoder Rnns, and demonstrates its capabilities in long-term prediction of high-dimensional fluid flows.

Decoupling Hierarchical Recurrent Neural Networks With Locally Computable Losses

It is empirically show that in (deep) HRNNs, propagating gradients back from higher to lower levels can be replaced by locally computable losses, without harming the learning capability of the network, over a wide range of tasks.

Learning long-term dependencies with gradient descent is difficult

This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.

Learning State Space Trajectories in Recurrent Neural Networks

A procedure for finding E/wij, where E is an error functional of the temporal trajectory of the states of a continuous recurrent network and wij are the weights of that network, which seems particularly suited for temporally continuous domains.

Encoding sequential structure: experience with the real-time recurrent learning algorithm

  • A. SmithD. Zipser
  • Computer Science
    International 1989 Joint Conference on Neural Networks
  • 1989
It is shown that recurrent nets trained with the RTRL (real-time recurrent learning) algorithm are able to learn tasks that Elman nets appear unable to learn. Moreover, they learn a more stringent

Gradient calculations for dynamic recurrent neural networks: a survey

The author discusses advantages and disadvantages of temporally continuous neural networks in contrast to clocked ones and presents some "tricks of the trade" for training, using, and simulating continuous time and recurrent neural networks.

Learning Complex, Extended Sequences Using the Principle of History Compression

A simple principle for reducing the descriptions of event sequences without loss of information is introduced and this insight leads to the construction of neural architectures that learn to divide and conquer by recursively decomposing sequences.

Learning long-term dependencies in NARX recurrent neural networks

It is shown that the long-term dependencies problem is lessened for a class of architectures called nonlinear autoregressive models with exogenous (NARX) recurrent neural networks, which have powerful representational capabilities.

Credit Assignment through Time: Alternatives to Backpropagation

This work considers and compares alternative algorithms and architectures on tasks for which the span of the input/output dependencies can be controlled and shows performance qualitatively superior to that obtained with backpropagation.

Long Short-Term Memory

A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

LSTM can Solve Hard Long Time Lag Problems

This work shows that problems used to promote various previous algorithms can be solved more quickly by random weight guessing than by the proposed algorithms, and uses LSTM, its own recent algorithm, to solve a hard problem.

Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks

These simulations suggest that recurrent controller networks trained by Kalman filter methods can combine the traditional features of state-space controllers and observers in a homogeneous architecture for nonlinear dynamical systems, while simultaneously exhibiting less sensitivity than do purely feedforward controller networks to changes in plant parameters and measurement noise.