Learning long-term dependencies with gradient descent is difficult

@article{Bengio1994LearningLD,
  title={Learning long-term dependencies with gradient descent is difficult},
  author={Yoshua Bengio and Patrice Y. Simard and Paolo Frasconi},
  journal={IEEE transactions on neural networks},
  year={1994},
  volume={5 2},
  pages={
          157-66
        }
}
Recurrent neural networks can be used to map input sequences to output sequences, such as for recognition, production or prediction problems. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. We show why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases. These… 

Figures from this paper

Learning long-term dependencies is not as difficult with NARX networks
TLDR
Although NARX networks do not circumvent the problem of long-term dependencies, they can greatly improve performance on such problems, and are presented with some experimental 'results that show that N ARX networks can often retain information for two to three times as long as conventional recurrent networks.
Learning Long-Term Dependencies in Segmented Memory Recurrent Neural Networks
TLDR
This paper proposes a novel architecture called Segmented-Memory Recurrent Neural Network (SMRNN), which is trained using an extended real time recurrent learning algorithm, which is gradient-based.
Tree Memory Networks for Sequence Processing
TLDR
A new layer type (the Tree Memory Unit), whose weight application scales logarithmically in the sequence length, which can lead to more efficient sequence learning if used on sequences with long-term dependencies.
Time Delay Learning by Gradient Descent in Recurrent Neural Networks
TLDR
It is demonstrated that the principle of time delay learning by gradient descent, although efficient for feed-forward neural networks and theoretically adaptable to RNNs, has shown itself to be difficult to use in this latter case.
Sampling-Based Gradient Regularization for Capturing Long-Term Dependencies in Recurrent Neural Networks
TLDR
An analytical framework is constructed to estimate a contribution of each training example to the norm of the long-term components of the target functions gradient and use it to hold thenorm of the gradients in the suitable range for stochastic gradient descent SGD training.
Short-Term Memory Optimization in Recurrent Neural Networks by Autoencoder-based Initialization
TLDR
An initialization schema that pretrains the weights of a recurrent neural network to approximate the linear autoencoder of the input sequences is introduced and it is shown how such pretraining can better support solving hard classification tasks with long sequences.
Longer RNNs
TLDR
This work introduces a new training method and a new recurrent architecture using residual connections, both aimed at increasing the range of dependencies that can be modeled by RNNs.
RECURRENT NEURAL NETWORKS
TLDR
This paper shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent, which is achieved by using a slight structural modification of the simple recurrent neural network architecture.
Can recurrent neural networks warp time?
TLDR
It is proved that learnable gates in a recurrent model formally provide quasi- invariance to general time transformations in the input data, which leads to a new way of initializing gate biases in LSTMs and GRUs.
...
...

References

SHOWING 1-10 OF 50 REFERENCES
The problem of learning long-term dependencies in recurrent networks
TLDR
Results are presented showing that learning long-term dependencies in recurrent neural networks using gradient descent is a very difficult task, and how this difficulty arises when robustly latching bits of information with certain attractors.
A Learning Algorithm for Continually Running Fully Recurrent Neural Networks
The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal
Inserting rules into recurrent neural networks
  • Colin Giles, C. Omlin
  • Computer Science
    Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop
  • 1992
TLDR
Simulations show that training recurrent networks with different amounts of partial knowledge to recognize simple grammers improves the training times by orders of magnitude, even when only a small fraction of all transitions are inserted as rules.
A Focused Backpropagation Algorithm for Temporal Pattern Recognition
  • M. Mozer
  • Computer Science
    Complex Syst.
  • 1989
TLDR
A specialized connectionist architecture and corre sponding specialization of the backpropagation learnin g algori thm th at opera tes efficiently on temporal sequences is introduced and should scale better than conventional recurrent architectures wit h respect to sequenc e length.
Induction of Multiscale Temporal Structure
TLDR
Simulation experiments indicate that slower time-scale hidden units are able to pick up global structure, structure that simply can not be learned by standard back propagation, using hidden units that operate with different time constants.
Unified Integration of Explicit Knowledge and Learning by Example in Recurrent Networks
TLDR
A novel unified approach for integrating explicit knowledge and learning by example in recurrent networks is proposed, which is accomplished by using a technique based on linear programming, instead of learning from random initial weights.
Using random weights to train multilayer networks of hard-limiting units
TLDR
A gradient descent algorithm suitable for training multilayer feedforward networks of processing units with hard-limiting output functions is presented and its performance is similar to that of conventional backpropagation applied to networks of units with sigmoidal characteristics.
A method of training multi-layer networks with heaviside characteristics using internal representations
TLDR
A learning algorithm is presented that uses internal representations, which are continuous random variables, for the training of multilayer networks whose neurons have Heaviside characteristics and does not require 'bit flipping' on the internal representations to reduce output error.
BPS: a learning algorithm for capturing the dynamic nature of speech
A novel backpropagation learning algorithm for a particular class of dynamic neural networks in which some units have a local feedback is proposed. Hence these networks can be trained to respond to
The "Moving Targets" Training Algorithm
TLDR
A simple method for training the dynamical behavior of a neural network using a gradient-based method and the optimization is carried out in the hidden part of state space either instead of, or in addition to weight space.
...
...