Universal characteristics of deep neural network loss surfaces from random matrix theory
@article{Baskerville2022UniversalCO, title={Universal characteristics of deep neural network loss surfaces from random matrix theory}, author={Nicholas P Baskerville and Jonathan P. Keating and Francesco Mezzadri and Joseph Najnudel and Diego Granziol}, journal={ArXiv}, year={2022}, volume={abs/2205.08601} }
This paper considers several aspects of random matrix universality in deep neural networks. Motivated by recent experimental work, we use universal properties of random matrices related to local statistics to derive practical implications for deep neural networks based on a realistic model of their Hessians. In particular we derive universal aspects of outliers in the spectra of deep neural networks and demonstrate the important role of random matrix local laws in popular pre-conditioning…
References
SHOWING 1-10 OF 66 REFERENCES
Nonlinear random matrix theory for deep learning
- Computer ScienceNIPS
- 2017
This work demonstrates that the pointwise nonlinearities typically applied in neural networks can be incorporated into a standard method of proof in random matrix theory known as the moments method, and identifies an intriguing new class of activation functions with favorable properties.
On Random Matrices Arising in Deep Neural Networks: General I.I.D. Case
- Computer Science, MathematicsRandom Matrices: Theory and Applications
- 2022
This paper generalizes the results of [22] to the case where the entries of the synaptic weight matrices are just independent identically distributed random variables with zero mean and finite fourth moment, and extends the property of the so-called macroscopic universality on the considered random matrices.
Appearance of random matrix theory in deep learning
- Computer SciencePhysica A: Statistical Mechanics and its Applications
- 2021
Random matrix analysis of deep neural network weight matrices
- Computer ScienceArXiv
- 2022
The weight matrices of trained deep neural networks are studied using methods from random matrix theory (RMT) and it is shown that the statistics of most of the singular values follow universal RMT predictions, suggesting that they are random and do not contain system specific information.
On Random Matrices Arising in Deep Neural Networks. Gaussian Case
- Computer Science, Mathematics
- 2020
The paper deals with distribution of singular values of product of random matrices arising in the analysis of deep neural networks by using a version of the standard techniques of random matrix theory under the assumption that the entries of data matrices are independent Gaussian random variables.
The loss surfaces of neural networks with general activation functions
- Computer ScienceArXiv
- 2020
A new path through the spin glass complexity calculations is charted using supersymmetric methods in random matrix theory which may prove useful in other contexts.
Beyond Random Matrix Theory for Deep Networks
- MathematicsArXiv
- 2020
This work investigates whether the Wigner semi-circle and Marcenko-Pastur distributions, often used for deep neural network theoretical analysis, match empirically observed spectral densities, and considers two new classes of matrix ensembles; random Wigninger/Wishart ensemble products and percolated WignER/Wigner ensemble, both of which better match observed spectra.
The Emergence of Spectral Universality in Deep Networks
- Computer ScienceAISTATS
- 2018
This work uses powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth.
The Loss Surfaces of Multilayer Networks
- Computer ScienceAISTATS
- 2015
It is proved that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.
A random matrix theory approach to damping in deep learning
- Computer Science
- 2022
A novel random matrix theory based damping learner for second order optimisers inspired by linear shrinkage estimation is developed, and it is demonstrated that the derived method works well with adaptive gradient methods such as Adam.