# The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions

@article{Hochreiter1998TheVG, title={The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions}, author={Sepp Hochreiter}, journal={Int. J. Uncertain. Fuzziness Knowl. Based Syst.}, year={1998}, volume={6}, pages={107-116} }

Recurrent nets are in principle capable to store past inputs to produce the currently desired output. Because of this property recurrent nets are used in time series prediction and process control. Practical applications involve temporal dependencies spanning many time steps, e.g. between relevant inputs and desired outputs. In this case, however, gradient based learning methods take too much time. The extremely increased learning time arises because the error vanishes as it gets propagated…

## 1,892 Citations

### Linear Antisymmetric Recurrent Neural Networks

- Computer ScienceL4DC
- 2020

This paper suggests a new recurrent network structure called Linear Antisymmetric RNN (LARNN), based on the numerical solution to an Ordinary Differential Equation (ODE) with stability properties resulting in a stable solution, which corresponds to long-term memory.

### Learning Long Term Dependencies with Recurrent Neural Networks

- Computer ScienceICANN
- 2006

It is shown that RNNs and especially normalised recurrent neural networks (NRNNs) unfolded in time are indeed very capable of learning time lags of at least a hundred time steps and it is demonstrated that the problem of a vanishing gradient does not apply to these networks.

### Recurrent Neural Networks

- Computer Science
- 2015

This paper shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent, which is achieved by using a slight structural modification of the simple recurrent neural network architecture.

### Learning Longer Memory in Recurrent Neural Networks

- Computer ScienceICLR
- 2015

This paper shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent, by using a slight structural modification of the simple recurrent neural network architecture.

### Short-Term Memory Optimization in Recurrent Neural Networks by Autoencoder-based Initialization

- Computer ScienceArXiv
- 2020

An initialization schema that pretrains the weights of a recurrent neural network to approximate the linear autoencoder of the input sequences is introduced and it is shown how such pretraining can better support solving hard classification tasks with long sequences.

### Reinforcement learning with recurrent neural networks

- Computer Science
- 2008

RNN can well map and reconstruct (partially observable) Markov decision processes and the resulting inner state of the network can be used as a basis for standard RL algorithms, which forms a novel connection between recurrent neural networks (RNN) and reinforcement learning (RL) techniques.

### Backpropagation-decorrelation: online recurrent learning with O(N) complexity

- Computer Science2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541)
- 2004

A new learning rule for fully recurrent neural networks is introduced which combines important principles: one-step backpropagation of errors and the usage of temporal memory in the network dynamics by means of decorrelation of activations.

### Learning from Predictions: Fusing Training and Autoregressive Inference for Long-Term Spatiotemporal Forecasts

- Computer ScienceSSRN Electronic Journal
- 2023

The results show that BPTT-SA effectively reduces iterative error propagation in convolutional RNNs and Convolutional Autoencoder Rnns, and demonstrates its capabilities in long-term prediction of high-dimensional fluid flows.

### Decoupling Hierarchical Recurrent Neural Networks With Locally Computable Losses

- Computer ScienceArXiv
- 2019

It is empirically show that in (deep) HRNNs, propagating gradients back from higher to lower levels can be replaced by locally computable losses, without harming the learning capability of the network, over a wide range of tasks.

## 32 References

### Learning long-term dependencies with gradient descent is difficult

- Computer ScienceIEEE Trans. Neural Networks
- 1994

This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.

### Learning State Space Trajectories in Recurrent Neural Networks

- Computer ScienceNeural Computation
- 1989

A procedure for finding E/wij, where E is an error functional of the temporal trajectory of the states of a continuous recurrent network and wij are the weights of that network, which seems particularly suited for temporally continuous domains.

### Encoding sequential structure: experience with the real-time recurrent learning algorithm

- Computer ScienceInternational 1989 Joint Conference on Neural Networks
- 1989

It is shown that recurrent nets trained with the RTRL (real-time recurrent learning) algorithm are able to learn tasks that Elman nets appear unable to learn. Moreover, they learn a more stringent…

### Gradient calculations for dynamic recurrent neural networks: a survey

- Computer ScienceIEEE Trans. Neural Networks
- 1995

The author discusses advantages and disadvantages of temporally continuous neural networks in contrast to clocked ones and presents some "tricks of the trade" for training, using, and simulating continuous time and recurrent neural networks.

### Learning Complex, Extended Sequences Using the Principle of History Compression

- Computer ScienceNeural Computation
- 1992

A simple principle for reducing the descriptions of event sequences without loss of information is introduced and this insight leads to the construction of neural architectures that learn to divide and conquer by recursively decomposing sequences.

### Learning long-term dependencies in NARX recurrent neural networks

- Computer ScienceIEEE Trans. Neural Networks
- 1996

It is shown that the long-term dependencies problem is lessened for a class of architectures called nonlinear autoregressive models with exogenous (NARX) recurrent neural networks, which have powerful representational capabilities.

### Credit Assignment through Time: Alternatives to Backpropagation

- Computer ScienceNIPS
- 1993

This work considers and compares alternative algorithms and architectures on tasks for which the span of the input/output dependencies can be controlled and shows performance qualitatively superior to that obtained with backpropagation.

### Long Short-Term Memory

- Computer ScienceNeural Computation
- 1997

A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

### LSTM can Solve Hard Long Time Lag Problems

- Computer ScienceNIPS
- 1996

This work shows that problems used to promote various previous algorithms can be solved more quickly by random weight guessing than by the proposed algorithms, and uses LSTM, its own recent algorithm, to solve a hard problem.

### Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks

- Engineering, Computer ScienceIEEE Trans. Neural Networks
- 1994

These simulations suggest that recurrent controller networks trained by Kalman filter methods can combine the traditional features of state-space controllers and observers in a homogeneous architecture for nonlinear dynamical systems, while simultaneously exhibiting less sensitivity than do purely feedforward controller networks to changes in plant parameters and measurement noise.