Characterizing the Spectrum of the NTK via a Power Series Expansion

  title={Characterizing the Spectrum of the NTK via a Power Series Expansion},
  author={Michael Murray and Hui Jin and Benjamin Bowman and Guido Mont{\'u}far},
Under mild conditions on the network initialization we derive a power series expansion for the Neural Tangent Kernel (NTK) of arbitrarily deep feedforward networks in the infinite width limit. We provide expressions for the coefficients of this power series which depend on both the Hermite coefficients of the activation function as well as the depth of the network. We observe faster decay of the Hermite coefficients leads to faster decay in the NTK coefficients and explore the role of depth… 

Figures and Tables from this paper



Tight Bounds on the Smallest Eigenvalue of the Neural Tangent Kernel for Deep ReLU Networks

Tight bounds on the smallest eigenvalue of NTK matrices for deep ReLU nets are provided, both in the limiting case of inflnite widths and for finitewidths.

The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies

It is shown theoretically and experimentally that a shallow neural network without bias cannot represent or learn simple, low frequency functions with odd frequencies and specific predictions of the time it will take a network to learn functions of varying frequency are led.

Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks

It is shown that the eigenvalue distributions of the CK and NTK converge to deterministic limits, and the agreement of these asymptotic predictions with the observed spectra for both synthetic and CIFAR-10 training data is demonstrated.

Explicit loss asymptotics in the gradient descent training of neural networks

This work shows that the learning trajectory of a wide network in a lazy training regime can be characterized by an explicit asymptotic at large training times, based on spectral analysis of the integral operator representing the linearized evolution of a large network trained on the expected loss.

Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime

The proof exploits the low-effective-rank property of the Fisher Information Matrix at initialization, which implies a low effective dimension of the model (far smaller than the number of parameters) and concludes that local capacity control from the low effective rank of the Fischer Information Matrix is still underexplored theoretically.

The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network

This work extends a recently-developed framework for studying spectra of nonlinear random matrices to characterize an important measure of curvature, namely the eigenvalues of the Fisher information matrix and finds that linear networks suffer worse conditioning than nonlinear networks and that non linear networks are generically non-degenerate.

Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks

It is concluded that damped deviations offers a simple and unifying perspective of the dynamics when optimizing the squared error.

Learning curves for Gaussian process regression with power-law priors and targets

It is shown that the generalization error of kernel ridge regression (KRR) has the same asymptotics as well as that of Gaussian process regression (GPR) when the eigenspectrum of the prior and the eigenexpansion coefficients of the target function decay with rate β.

Frequency Bias in Neural Networks for Input of Non-Uniform Density

The Neural Tangent Kernel model is used to explore the effect of variable density on training dynamics and convergence results for deep, fully connected networks with respect to the spectral decomposition of the NTK are proved.

Universal statistics of Fisher information in deep neural networks: mean field approach

Novel statistics of FIM are revealed that are universal among a wide class of DNNs and can be connected to a norm-based capacity measure of generalization ability and quantitatively estimate an appropriately sized learning rate for gradient methods to converge.