• Corpus ID: 16074905

Understanding the exploding gradient problem

@article{Pascanu2012UnderstandingTE,
  title={Understanding the exploding gradient problem},
  author={Razvan Pascanu and Tomas Mikolov and Yoshua Bengio},
  journal={ArXiv},
  year={2012},
  volume={abs/1211.5063}
}
Training Recurrent Neural Networks is more troublesome than feedforward ones because of the vanishing and exploding gradient problems detailed in Bengio et al. (1994. [] Key Result In the experimental section, the comparison between this heuristic solution and standard SGD provides empirical evidence towards our hypothesis as well as it shows that such a heuristic is required to reach state of the art results on a character prediction task and a polyphonic music prediction one.

Figures and Tables from this paper

On the vanishing and exploding gradient problem in Gated Recurrent Units
ANTISYMMETRICRNN: A DYNAMICAL SYSTEM VIEW
  • Computer Science
  • 2018
TLDR
This paper draws connections between recurrent networks and ordinary differential equations and proposes a special form of recurrent network called AntisymmetricRNN, able to capture long-term dependencies thanks to the stability property of its underlying differential equation.
uRNN : An Approach to Bounded Gradients
TLDR
This work proves certain additional bounds related to gradients norms and implements various forms of unitary recurrent networks, which exhibit signs of “long-term memory” on both a concocted pathological task and language modelling.
AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks
TLDR
This paper draws connections between recurrent networks and ordinary differential equations and proposes a special form of recurrent networks called AntisymmetricRNN, able to capture long-term dependencies thanks to the stability property of its underlying differential equation.
Advances in optimizing recurrent networks
TLDR
Experiments reported here evaluate the use of clipping gradients, spanning longer time ranges with leaky integration, advanced momentum techniques, using more powerful output probability models, and encouraging sparser gradients to help symmetry breaking and credit assignment.
Chapter 2 Properties and Training in Recurrent Neural Networks
  • Computer Science
  • 2020
TLDR
This chapter describes the basic concepts behind the functioning of recurrent neural networks and explains the general properties that are common to several existing architectures, and discusses several ways of regularizing the system, highlighting their advantages and drawbacks.
Target Propagation in Recurrent Neural Networks
TLDR
This paper presents a novel algorithm for training recurrent networks, target propagation through time (TPTT), that outperforms standard backpropagationThrough time (BPTT) on four out of the five problems used for testing.
Counterfactual Learning of Recurrent Neural Networks
  • Computer Science
  • 2020
TLDR
However, current techniques (Truncated Backpropagation Through Time) are inherently limited to just being able to learn short-term time dependencies, which makes them impractical for really long sequences.
Initialization of Weights in Neural Networks
TLDR
This paper discusses different approaches to weight initialization and compares their results on few datasets to find out the best technique that can be employed to achieve higher accuracy in relatively lower duration.
Sparsity through evolutionary pruning prevents neuronal networks from overfitting
...
...

References

SHOWING 1-10 OF 23 REFERENCES
The problem of learning long-term dependencies in recurrent networks
TLDR
Results are presented showing that learning long-term dependencies in recurrent neural networks using gradient descent is a very difficult task, and how this difficulty arises when robustly latching bits of information with certain attractors.
Learning long-term dependencies with gradient descent is difficult
TLDR
This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.
New results on recurrent network training: unifying the algorithms and accelerating convergence
TLDR
An on-line version of the proposed algorithm, which is based on approximating the error gradient, has lower computational complexity in computing the weight update than the competing techniques for most typical problems and reaches the error minimum in a much smaller number of iterations.
Learning representations by back-propagating errors
TLDR
Back-propagation repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector, which helps to represent important features of the task domain.
Generating Text with Recurrent Neural Networks
TLDR
The power of RNNs trained with the new Hessian-Free optimizer by applying them to character-level language modeling tasks is demonstrated, and a new RNN variant that uses multiplicative connections which allow the current input character to determine the transition matrix from one hidden state vector to the next is introduced.
A Learning Algorithm for Continually Running Fully Recurrent Neural Networks
The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
TLDR
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
Bifurcations of Recurrent Neural Networks in Gradient Descent Learning
TLDR
Some of the factors underlying successful training of recurrent networks are investigated, such as choice of initial connections, choice of input patterns, teacher forcing, and truncated learning equations.
A neurodynamical model for working memory
Reservoir computing approaches to recurrent neural network training
...
...