• Corpus ID: 17272965

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

  title={Exact solutions to the nonlinear dynamics of learning in deep linear neural networks},
  author={Andrew M. Saxe and James L. McClelland and Surya Ganguli},
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. [] Key Method We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial…

Figures from this paper

Exact natural gradient in deep linear networks and its application to the nonlinear case

This work derives an exact expression for the natural gradient in deep linear networks, which exhibit pathological curvature similar to the nonlinear case, and provides for the first time an analytical solution for its convergence rate.

On the information bottleneck theory of deep learning

This work studies the information bottleneck (IB) theory of deep learning, and finds that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa.

Step Size Matters in Deep Learning

The relationship between the step size of the algorithm and the solutions that can be obtained with this algorithm is shown, providing an explanation for several phenomena observed in practice, including the deterioration in the training error with increased depth, the hardness of estimating linear mappings with large singular values, and the distinct performance of deep residual networks.


This analysis of the dynamics of training deep neural networks under a generalized family of natural gradient methods that applies curvature corrections, and derive precise analytical solutions reveals that curvature corrected update rules preserve many features of gradient descent, such that the learning trajectory of each singular mode in natural gradient descent follows precisely the same path as gradient descent.


The results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.

An analytic theory of generalization dynamics and transfer learning in deep linear networks

An analytic theory of the nonlinear dynamics of generalization in deep linear networks, both within and across tasks is developed and reveals that knowledge transfer depends sensitively, but computably, on the SNRs and input feature alignments of pairs of tasks.

On the generalization of learning algorithms that do not converge

This work proposes a notion of statistical algorithmic stability (SAS) that extends classical algorithmic Stability to non-convergent algorithms and studies its connection to generalization, and proves that the stability of the time-asymptotic behavior of a learning algorithm relates to its generalization.

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

This work uses powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian, and reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning.

Statistical Mechanics of Deep Linear Neural Networks: The Back-Propagating Renormalization Group

This work is the first exact statistical mechanical study of learning in a family of Deep Neural Networks, and the first development of the Renormalization Group approach to the weight space of these systems.



Understanding the difficulty of training deep feedforward neural networks

The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.

Greedy Layer-Wise Training of Deep Networks

These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

On the importance of initialization and momentum in deep learning

It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.

Learning hierarchical category structure in deep neural networks

This work considers training a neural network on data generated by a hierarchically structured probabilistic generative process, and finds solutions to the dynamics of error-correcting learning in linear three layer neural networks.

The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training

The experiments confirm and clarify the advantage of unsupervised pre- training, and empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.

Why Does Unsupervised Pre-training Help Deep Learning?

The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre- training.

Neural Networks: Tricks of the Trade

It is shown how nonlinear semi-supervised embedding algorithms popular for use with â œshallowâ learning techniques such as kernel methods can be easily applied to deep multi-layer architectures.

Neural networks and principal component analysis: Learning from examples without local minima

Learning Deep Architectures for AI

The motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer modelssuch as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks are discussed.

Effect of Batch Learning in Multilayer Neural Networks

Experimental study on multilayer perceptrons and linear neural networks (LNN) shows that batch learning induces strong overtrain-ing on both models in overrealizable cases, which means the degrade of generalization error by surplus units can be alleviated.