# Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function

@article{Tarnowski2019DynamicalII, title={Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function}, author={Wojciech Tarnowski and Piotr Warchol and Stanislaw Jastrzebski and Jacek Tabor and Maciej A. Nowak}, journal={ArXiv}, year={2019}, volume={abs/1809.08848} }

We demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespective of the activation function used. We do that by deriving, with the help of Free Probability and Random Matrix Theories, a universal formula for the spectral density of the input-output Jacobian at initialization, in the large network width and depth limit. The resulting singular value spectrum depends on a single parameter, which we calculate for a variety of popular activation functions, by…

## 28 Citations

The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry

- Computer ScienceAISTATS
- 2021

An experimental result shows that FIM's dependence on the depth determines the appropriate size of the learning rate for convergence at the initial phase of the online training of DNNs.

TION IN OPTIMIZING DEEP LINEAR NETWORKS

- Computer Science
- 2020

The results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

- Computer ScienceICLR
- 2020

The results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.

On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization

- Computer ScienceIJCAI
- 2021

This work studies the Neural Tangent Kernel (NTK), which can describe dynamics of gradient descent training of wide network, and proves that NTK of Gaussian and orthogonal weights are equal when the network width is infinite, resulting in a conclusion that Orthogonal initialization can speed up training is a finite-width effect in the small learning rate regime.

A Comprehensive and Modularized Statistical Framework for Gradient Norm Equality in Deep Neural Networks

- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2022

A novel metric called Block Dynamical Isometry is proposed, which measures the change of gradient norm in individual blocks and finds that it is a universal philosophy behind them, and a novel normalization technique named second moment normalization, which has 30 percent fewer computation overhead than batch normalization without accuracy loss and has better performance under micro batch size.

Spectrum Concentration in Deep Residual Learning: A Free Probability Approach

- Computer ScienceIEEE Access
- 2019

It is empirically demonstrated that the proposed initialization scheme learns at a speed of orders of magnitudes faster than the classical ones, and thus attests a strong practical relevance of this investigation.

Residual Networks as Nonlinear Systems: Stability Analysis using Linearization

- MathematicsArXiv
- 2019

It is found that there is a dramatic jump in the magnitude of adversarial perturbations towards the end of the final stage of the network that is not present in the case of random perturbation.

ON THE RELATIONSHIP BETWEEN TOPOLOGY AND GRADIENT PROPAGATION IN DEEP NETWORKS

- Computer Science
- 2020

This paper establishes a theoretical link between NN-Mass, a topological property of neural architectures, and gradient flow characteristics and can identify models with similar accuracy, despite having significantly different size/compute requirements.

On Random Matrices Arising in Deep Neural Networks: General I.I.D. Case

- Computer Science, MathematicsRandom Matrices: Theory and Applications
- 2022

This paper generalizes the results of [22] to the case where the entries of the synaptic weight matrices are just independent identically distributed random variables with zero mean and finite fourth moment, and extends the property of the so-called macroscopic universality on the considered random matrices.

Advancing Deep Residual Learning by Solving the Crux of Degradation in Spiking Neural Networks

- Computer ScienceArXiv
- 2022

This paper identifies the crux and proposes a novel residual block for SNNs, which is able to significantly extend the depth of directly trained SNN's, e.g., up to 482 layers on CIFAR-10 and 104 layers on ImageNet, without observing any slight degradation problem.

## References

SHOWING 1-10 OF 54 REFERENCES

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

- Computer ScienceNIPS
- 2017

This work uses powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian, and reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning.

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks

- Computer ScienceICML
- 2018

This work demonstrates that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme, and presents an algorithm for generating such random initial orthogonal convolution kernels.

Mean Field Residual Networks: On the Edge of Chaos

- Computer ScienceNIPS
- 2017

It is shown, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth.

The Emergence of Spectral Universality in Deep Networks

- Computer ScienceAISTATS
- 2018

This work uses powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth.

Spectrum Concentration in Deep Residual Learning: A Free Probability Approach

- Computer ScienceIEEE Access
- 2019

It is empirically demonstrated that the proposed initialization scheme learns at a speed of orders of magnitudes faster than the classical ones, and thus attests a strong practical relevance of this investigation.

The Loss Surfaces of Multilayer Networks

- Computer ScienceAISTATS
- 2015

It is proved that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.

Geometry of Neural Network Loss Surfaces via Random Matrix Theory

- Computer ScienceICML
- 2017

An analytical framework and a set of tools from random matrix theory that allow us to compute an approximation of the distribution of eigenvalues of the Hessian matrix at critical points of varying energy are introduced.

Nonlinear random matrix theory for deep learning

- Computer ScienceNIPS
- 2017

This work demonstrates that the pointwise nonlinearities typically applied in neural networks can be incorporated into a standard method of proof in random matrix theory known as the moments method, and identifies an intriguing new class of activation functions with favorable properties.

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

- Computer ScienceICLR
- 2014

It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.

Understanding the difficulty of training deep feedforward neural networks

- Computer ScienceAISTATS
- 2010

The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.