# Learning long-term dependencies with gradient descent is difficult

@article{Bengio1994LearningLD, title={Learning long-term dependencies with gradient descent is difficult}, author={Yoshua Bengio and Patrice Y. Simard and Paolo Frasconi}, journal={IEEE transactions on neural networks}, year={1994}, volume={5 2}, pages={ 157-66 } }

Recurrent neural networks can be used to map input sequences to output sequences, such as for recognition, production or prediction problems. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. We show why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases. These…

## 6,240 Citations

Learning long-term dependencies is not as difficult with NARX networks

- Computer ScienceNIPS
- 1995

Although NARX networks do not circumvent the problem of long-term dependencies, they can greatly improve performance on such problems, and are presented with some experimental 'results that show that N ARX networks can often retain information for two to three times as long as conventional recurrent networks.

Learning Long-Term Dependencies in Segmented Memory Recurrent Neural Networks

- Computer ScienceISNN
- 2004

This paper proposes a novel architecture called Segmented-Memory Recurrent Neural Network (SMRNN), which is trained using an extended real time recurrent learning algorithm, which is gradient-based.

Tree Memory Networks for Sequence Processing

- Computer ScienceICANN
- 2019

A new layer type (the Tree Memory Unit), whose weight application scales logarithmically in the sequence length, which can lead to more efficient sequence learning if used on sequences with long-term dependencies.

Learning long-term dependencies by the selective addition of time-delayed connections to recurrent neural networks

- Computer ScienceNeurocomputing
- 2002

Time Delay Learning by Gradient Descent in Recurrent Neural Networks

- Computer ScienceICANN
- 2005

It is demonstrated that the principle of time delay learning by gradient descent, although efficient for feed-forward neural networks and theoretically adaptable to RNNs, has shown itself to be difficult to use in this latter case.

Sampling-Based Gradient Regularization for Capturing Long-Term Dependencies in Recurrent Neural Networks

- Computer ScienceICONIP
- 2016

An analytical framework is constructed to estimate a contribution of each training example to the norm of the long-term components of the target functions gradient and use it to hold thenorm of the gradients in the suitable range for stochastic gradient descent SGD training.

Short-Term Memory Optimization in Recurrent Neural Networks by Autoencoder-based Initialization

- Computer ScienceArXiv
- 2020

An initialization schema that pretrains the weights of a recurrent neural network to approximate the linear autoencoder of the input sequences is introduced and it is shown how such pretraining can better support solving hard classification tasks with long sequences.

Longer RNNs

- Computer Science
- 2016

This work introduces a new training method and a new recurrent architecture using residual connections, both aimed at increasing the range of dependencies that can be modeled by RNNs.

RECURRENT NEURAL NETWORKS

- Computer Science
- 2015

This paper shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent, which is achieved by using a slight structural modification of the simple recurrent neural network architecture.

Can recurrent neural networks warp time?

- Computer ScienceICLR
- 2018

It is proved that learnable gates in a recurrent model formally provide quasi- invariance to general time transformations in the input data, which leads to a new way of initializing gate biases in LSTMs and GRUs.

## References

SHOWING 1-10 OF 50 REFERENCES

The problem of learning long-term dependencies in recurrent networks

- Computer ScienceIEEE International Conference on Neural Networks
- 1993

Results are presented showing that learning long-term dependencies in recurrent neural networks using gradient descent is a very difficult task, and how this difficulty arises when robustly latching bits of information with certain attractors.

A Learning Algorithm for Continually Running Fully Recurrent Neural Networks

- Computer ScienceNeural Computation
- 1989

The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal…

Inserting rules into recurrent neural networks

- Computer ScienceNeural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop
- 1992

Simulations show that training recurrent networks with different amounts of partial knowledge to recognize simple grammers improves the training times by orders of magnitude, even when only a small fraction of all transitions are inserted as rules.

A Focused Backpropagation Algorithm for Temporal Pattern Recognition

- Computer ScienceComplex Syst.
- 1989

A specialized connectionist architecture and corre sponding specialization of the backpropagation learnin g algori thm th at opera tes efficiently on temporal sequences is introduced and should scale better than conventional recurrent architectures wit h respect to sequenc e length.

Induction of Multiscale Temporal Structure

- Computer ScienceNIPS
- 1991

Simulation experiments indicate that slower time-scale hidden units are able to pick up global structure, structure that simply can not be learned by standard back propagation, using hidden units that operate with different time constants.

Unified Integration of Explicit Knowledge and Learning by Example in Recurrent Networks

- Computer ScienceIEEE Trans. Knowl. Data Eng.
- 1995

A novel unified approach for integrating explicit knowledge and learning by example in recurrent networks is proposed, which is accomplished by using a technique based on linear programming, instead of learning from random initial weights.

Using random weights to train multilayer networks of hard-limiting units

- Computer ScienceIEEE Trans. Neural Networks
- 1992

A gradient descent algorithm suitable for training multilayer feedforward networks of processing units with hard-limiting output functions is presented and its performance is similar to that of conventional backpropagation applied to networks of units with sigmoidal characteristics.

A method of training multi-layer networks with heaviside characteristics using internal representations

- Computer ScienceIEEE International Conference on Neural Networks
- 1993

A learning algorithm is presented that uses internal representations, which are continuous random variables, for the training of multilayer networks whose neurons have Heaviside characteristics and does not require 'bit flipping' on the internal representations to reduce output error.

BPS: a learning algorithm for capturing the dynamic nature of speech

- Computer ScienceInternational 1989 Joint Conference on Neural Networks
- 1989

A novel backpropagation learning algorithm for a particular class of dynamic neural networks in which some units have a local feedback is proposed. Hence these networks can be trained to respond to…

The "Moving Targets" Training Algorithm

- Computer ScienceNIPS
- 1989

A simple method for training the dynamical behavior of a neural network using a gradient-based method and the optimization is carried out in the hidden part of state space either instead of, or in addition to weight space.