Nonlinear random matrix theory for deep learning

  title={Nonlinear random matrix theory for deep learning},
  author={Jeffrey Pennington and Pratik Worah},
  journal={Journal of Statistical Mechanics: Theory and Experiment},
Neural network configurations with random weights play an important role in the analysis of deep learning. They define the initial loss landscape and are closely related to kernel and random feature methods. Despite the fact that these networks are built out of random matrices, the vast and powerful machinery of random matrix theory has so far found limited success in studying them. A main obstacle in this direction is that neural networks are nonlinear, which prevents the straightforward… 

Eigenvalue Distribution of Large Random Matrices Arising in Deep Neural Networks: Orthogonal Case

  • L. Pastur
  • Computer Science
    Journal of Mathematical Physics
  • 2022
This paper justifies the validity of the mean field approximation in the infinite width limit for the deep untrained neural networks and extends the macroscopic universality of random matrix theory to this new class of random matrices.

Random matrix analysis of deep neural network weight matrices

The weight matrices of trained deep neural networks are studied using methods from random matrix theory (RMT) and it is shown that the statistics of most of the singular values follow universal RMT predictions, suggesting that they are random and do not contain system specific information.

The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network

This work extends a recently-developed framework for studying spectra of nonlinear random matrices to characterize an important measure of curvature, namely the eigenvalues of the Fisher information matrix and finds that linear networks suffer worse conditioning than nonlinear networks and that non linear networks are generically non-degenerate.

Largest Eigenvalues of the Conjugate Kernel of Single-Layered Neural Networks

The asymptotic limit of the largest eigenvalue for the nonlinear model to that of an information plus noise random matrix, establishing a possible phase transition depending on the function f and the distribution of W and X.

A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning

Intriguingly, it is found that a mixture of nonlinearities can outperform the best single nonlinearity on the noisy autoecndoing task, suggesting that mixtures of non linearities might be useful for approximate kernel methods or neural network architecture design.

Analysis of One-Hidden-Layer Neural Networks via the Resolvent Method

The Stieltjes transform of the limiting spectral distribution satisfies a quartic self-consistent equation up to some error terms, which is exactly the equation obtained by Pennington and Worah and Benigni and Péché with the moment method approach.

Universal characteristics of deep neural network loss surfaces from random matrix theory

This paper considers several aspects of random matrix universality in deep neural networks. Motivated by recent experimental work, we use universal properties of random matrices related to local

On the Approximation Lower Bound for Neural Nets with Random Weights

It is shown that, despite the well-known fact that a shallow neural network is a universal approximator, a random net cannot achieve zero approximation error even for smooth functions, and it is proved that if the proposal distribution is compactly supported, then a lower bound is positive.

A Random Matrix Perspective on Mixtures of Nonlinearities in High Dimensions

This work analyzes the performance of random feature regression with features F = f ( WX + B ) for a random weight matrix W and bias vector B, obtaining exact formulae for the asymptotic training and test errors for data generated by a linear teacher model.

Universal statistics of Fisher information in deep neural networks: mean field approach

Novel statistics of FIM are revealed that are universal among a wide class of DNNs and can be connected to a norm-based capacity measure of generalization ability and quantitatively estimate an appropriately sized learning rate for gradient methods to converge.



A Correspondence Between Random Neural Networks and Statistical Field Theory

This work shows that the distribution of pre-activations in random neural networks can be exactly mapped onto lattice models in statistical physics, and argues that several previous investigations of stochastic networks actually studied a particular factorial approximation to the full lattice model.

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

This work uses powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian, and reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning.

Exponential expressivity in deep neural networks through transient chaos

The theoretical analysis of the expressive power of deep networks broadly applies to arbitrary nonlinearities, and provides a quantitative underpinning for previously abstract notions about the geometry of deep functions.

The Loss Surfaces of Multilayer Networks

It is proved that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.

The spectrum of kernel random matrices

Surprisingly, it is shown that in high-dimensions, and for the models the authors analyze, the problem becomes essentially linear—which is at odds with heuristics sometimes used to justify the usage of these methods.

A Random Matrix Approach to Neural Networks

It is proved that, as $n,p,T$ grow large at the same rate, the resolvent $Q=(G+\gamma I_T)^{-1}$, for $\gamma>0$ has a similar behavior as that met in sample covariance matrix models, which enables the estimation of the asymptotic performance of single-layer random neural networks.

Spectral density of products of Wishart dilute random matrices. Part I: the dense case

This work derives that the spectral density is a solution of a polynomial equation of degree $M+1$ and obtains exact expressions of it for $M=1, $2$ and $3$ and makes some observations for general $M$, based admittedly on some weak numerical evidence.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

On the Expressive Power of Deep Neural Networks

We propose a new approach to the problem of neural network expressivity, which seeks to characterize how structural properties of a neural network family affect the functions it is able to compute.