• Corpus ID: 52815952

Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function

@article{Tarnowski2019DynamicalII,
  title={Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function},
  author={Wojciech Tarnowski and Piotr Warchol and Stanislaw Jastrzebski and Jacek Tabor and Maciej A. Nowak},
  journal={ArXiv},
  year={2019},
  volume={abs/1809.08848}
}
We demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespective of the activation function used. We do that by deriving, with the help of Free Probability and Random Matrix Theories, a universal formula for the spectral density of the input-output Jacobian at initialization, in the large network width and depth limit. The resulting singular value spectrum depends on a single parameter, which we calculate for a variety of popular activation functions, by… 

Figures from this paper

The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry
TLDR
An experimental result shows that FIM's dependence on the depth determines the appropriate size of the learning rate for convergence at the initial phase of the online training of DNNs.
TION IN OPTIMIZING DEEP LINEAR NETWORKS
TLDR
The results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.
Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks
TLDR
The results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.
On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization
TLDR
This work studies the Neural Tangent Kernel (NTK), which can describe dynamics of gradient descent training of wide network, and proves that NTK of Gaussian and orthogonal weights are equal when the network width is infinite, resulting in a conclusion that Orthogonal initialization can speed up training is a finite-width effect in the small learning rate regime.
A Comprehensive and Modularized Statistical Framework for Gradient Norm Equality in Deep Neural Networks
TLDR
A novel metric called Block Dynamical Isometry is proposed, which measures the change of gradient norm in individual blocks and finds that it is a universal philosophy behind them, and a novel normalization technique named second moment normalization, which has 30 percent fewer computation overhead than batch normalization without accuracy loss and has better performance under micro batch size.
Spectrum Concentration in Deep Residual Learning: A Free Probability Approach
TLDR
It is empirically demonstrated that the proposed initialization scheme learns at a speed of orders of magnitudes faster than the classical ones, and thus attests a strong practical relevance of this investigation.
Residual Networks as Nonlinear Systems: Stability Analysis using Linearization
TLDR
It is found that there is a dramatic jump in the magnitude of adversarial perturbations towards the end of the final stage of the network that is not present in the case of random perturbation.
ON THE RELATIONSHIP BETWEEN TOPOLOGY AND GRADIENT PROPAGATION IN DEEP NETWORKS
  • Computer Science
  • 2020
TLDR
This paper establishes a theoretical link between NN-Mass, a topological property of neural architectures, and gradient flow characteristics and can identify models with similar accuracy, despite having significantly different size/compute requirements.
On Random Matrices Arising in Deep Neural Networks: General I.I.D. Case
  • L. PasturV. Slavin
  • Computer Science, Mathematics
    Random Matrices: Theory and Applications
  • 2022
TLDR
This paper generalizes the results of [22] to the case where the entries of the synaptic weight matrices are just independent identically distributed random variables with zero mean and finite fourth moment, and extends the property of the so-called macroscopic universality on the considered random matrices.
Advancing Deep Residual Learning by Solving the Crux of Degradation in Spiking Neural Networks
TLDR
This paper identifies the crux and proposes a novel residual block for SNNs, which is able to significantly extend the depth of directly trained SNN's, e.g., up to 482 layers on CIFAR-10 and 104 layers on ImageNet, without observing any slight degradation problem.
...
...

References

SHOWING 1-10 OF 54 REFERENCES
Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice
TLDR
This work uses powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian, and reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning.
Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks
TLDR
This work demonstrates that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme, and presents an algorithm for generating such random initial orthogonal convolution kernels.
Mean Field Residual Networks: On the Edge of Chaos
TLDR
It is shown, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth.
The Emergence of Spectral Universality in Deep Networks
TLDR
This work uses powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth.
Spectrum Concentration in Deep Residual Learning: A Free Probability Approach
TLDR
It is empirically demonstrated that the proposed initialization scheme learns at a speed of orders of magnitudes faster than the classical ones, and thus attests a strong practical relevance of this investigation.
The Loss Surfaces of Multilayer Networks
TLDR
It is proved that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.
Geometry of Neural Network Loss Surfaces via Random Matrix Theory
TLDR
An analytical framework and a set of tools from random matrix theory that allow us to compute an approximation of the distribution of eigenvalues of the Hessian matrix at critical points of varying energy are introduced.
Nonlinear random matrix theory for deep learning
TLDR
This work demonstrates that the pointwise nonlinearities typically applied in neural networks can be incorporated into a standard method of proof in random matrix theory known as the moments method, and identifies an intriguing new class of activation functions with favorable properties.
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
TLDR
It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.
Understanding the difficulty of training deep feedforward neural networks
TLDR
The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
...
...