Advances in optimizing recurrent networks

  title={Advances in optimizing recurrent networks},
  author={Yoshua Bengio and Nicolas Boulanger-Lewandowski and Razvan Pascanu},
  journal={2013 IEEE International Conference on Acoustics, Speech and Signal Processing},
After a more than decade-long period of relatively little research activity in the area of recurrent neural networks, several new developments will be reviewed here that have allowed substantial progress both in understanding and in technical solutions towards more efficient training of recurrent networks. These advances have been motivated by and related to the optimization issues surrounding deep learning. Although recurrent networks are extremely powerful in what they can in principle… 

Tables from this paper

Learning Multiple Timescales in Recurrent Neural Networks

The results show that partitioning hidden layers under distinct temporal constraints enables the learning of multiple timescales, which contributes to the understanding of the fundamental conditions that allow RNNs to self-organize to accurate temporal abstractions.

Recurrent Neural Networks

This paper shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent, which is achieved by using a slight structural modification of the simple recurrent neural network architecture.

Learning Longer Memory in Recurrent Neural Networks

This paper shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent, by using a slight structural modification of the simple recurrent neural network architecture.

On Fast Dropout and its Applicability to Recurrent Networks

This paper analyzes fast dropout, a recent regularization method for generalized linear models and neural networks from a back-propagation inspired perspective and shows that it implements a quadratic form of an adaptive, per-parameter regularizer, which rewards large weights in the light of underfitting, penalizes them for overconfident predictions and vanishes at minima of an unregularized training loss.

Residual Recurrent Neural Networks for Learning Sequential Representations

The results show that the RNN unit reformulate to learn the residual functions with reference to the hidden state gives state-of-the-art performance, outperforms LSTM and GRU layers in terms of speed, and supports an accuracy competitive with that of the other methods.

Recent Advances in Recurrent Neural Networks

A survey on RNNs and several new advances for newcomers and professionals in the field are presented and the research challenges are introduced.

MomentumRNN: Integrating Momentum into Recurrent Neural Networks

It is theoretically prove and numerically demonstrate that MomentumRNNs alleviate the vanishing gradient issue in training RNNs, and it is shown that other advanced momentum-based optimization methods, such as Adam and Nesterov accelerated gradients with a restart, can be easily incorporated into the Momentum RNN framework.

Sampling-Based Gradient Regularization for Capturing Long-Term Dependencies in Recurrent Neural Networks

An analytical framework is constructed to estimate a contribution of each training example to the norm of the long-term components of the target functions gradient and use it to hold thenorm of the gradients in the suitable range for stochastic gradient descent SGD training.

Regularizing Recurrent Networks - On Injected Noise and Norm-based Methods

Evidence is concluded that training with noise does not improve performance as conjectured by a few works in RNN optimization before the authors', and the remembering and generalization ability of RNNs on polyphonic musical datasets is evaluated.

Conditional Computation in Deep and Recurrent Neural Networks

Two cases of conditional computation are explored – in the feed forward case, a technique is developed that trades off accuracy for potential computational benefits, and in the recurrent case, techniques that yield practical speed benefits on a language modeling task are demonstrated.



Learning Recurrent Neural Networks with Hessian-Free Optimization

This work solves the long-outstanding problem of how to effectively train recurrent neural networks on complex and difficult sequence modeling problems which may contain long-term data dependencies and offers a new interpretation of the generalized Gauss-Newton matrix of Schraudolph which is used within the HF approach of Martens.

Training Recurrent Neural Networks

A new probabilistic sequence model that combines Restricted Boltzmann Machines and RNNs is described, more powerful than similar models while being less difficult to train, and a random parameter initialization scheme is described that allows gradient descent with momentum to train Rnns on problems with long-term dependencies.

Learning long-term dependencies with gradient descent is difficult

This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.

Greedy Layer-Wise Training of Deep Networks

These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

Context dependent recurrent neural network language model

This paper improves recurrent neural network language models performance by providing a contextual real-valued input vector in association with each word to convey contextual information about the sentence being modeled by performing Latent Dirichlet Allocation using a block of preceding text.

Understanding the exploding gradient problem

The analysis is used to justify the simple yet effective solution of norm clipping the exploded gradient, and the comparison between this heuristic solution and standard SGD provides empirical evidence towards the hypothesis that such a heuristic is required to reach state of the art results on a character prediction task and a polyphonic music prediction one.

Hierarchical Recurrent Neural Networks for Long-Term Dependencies

This paper proposes to use a more general type of a-priori knowledge, namely that the temporal dependencies are structured hierarchically, which implies that long-term dependencies are represented by variables with a long time scale.

Why Does Unsupervised Pre-training Help Deep Learning?

The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre- training.

Learning Deep Architectures for AI

The motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer modelssuch as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks are discussed.