Random matrix analysis of deep neural network weight matrices

  title={Random matrix analysis of deep neural network weight matrices},
  author={Matthias Thamm and Max Staats and Bernd Rosenow},
Neural networks have been used successfully in a variety of fields, which has led to a great deal of interest in developing a theoretical understanding of how they store the information needed to perform a particular task. We study the weight matrices of trained deep neural networks using methods from random matrix theory (RMT) and show that the statistics of most of the singular values follow universal RMT predictions. This suggests that they are random and do not contain system specific… 

Figures and Tables from this paper

Boundary between noise and information applied to filtering neural network weight matrices
An algorithm is introduced, which both removes small singular values and reduces the magnitude of large singular values to counteract the effect of level repulsion between the noise and the information part of the spectrum.
Universal characteristics of deep neural network loss surfaces from random matrix theory
This paper considers several aspects of random matrix universality in deep neural networks. Motivated by recent experimental work, we use universal properties of random matrices related to local


Beyond Random Matrix Theory for Deep Networks
This work investigates whether the Wigner semi-circle and Marcenko-Pastur distributions, often used for deep neural network theoretical analysis, match empirically observed spectral densities, and considers two new classes of matrix ensembles; random Wigninger/Wishart ensemble products and percolated WignER/Wigner ensemble, both of which better match observed spectra.
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
A theory to identify 5+1 Phases of Training, corresponding to increasing amounts of Implicit Self-Regularization, which demonstrates that DNN optimization with larger batch sizes leads to less-well implicitly-regularized models, and it provides an explanation for the generalization gap phenomena.
The Emergence of Spectral Universality in Deep Networks
This work uses powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth.
Understanding the difficulty of training deep feedforward neural networks
The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
Understanding deep learning (still) requires rethinking generalization
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity.
Deep learning generalizes because the parameter-function map is biased towards simple functions
This paper argues that the parameter-function map of many DNNs should be exponentially biased towards simple functions, and provides clear evidence for this strong simplicity bias in a model DNN for Boolean functions, as well as in much larger fully connected and convolutional networks applied to CIFAR10 and MNIST.
Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data
The techniques can be used to identify when a pretrained DNN has problems that can not be detected simply by examining training/test accuracies, and it is shown how poorly-trained (and/or poorly fine-tuned) models may exhibit both Scale Collapse and unusually large PL exponents, in particular for recent NLP models.
A Random Matrix Approach to Neural Networks
It is proved that, as $n,p,T$ grow large at the same rate, the resolvent $Q=(G+\gamma I_T)^{-1}$, for $\gamma>0$ has a similar behavior as that met in sample covariance matrix models, which enables the estimation of the asymptotic performance of single-layer random neural networks.
Qualitatively characterizing neural network optimization problems
A simple analysis technique is introduced to look for evidence that state-of-the-art neural networks are overcoming local optima, and finds that, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.