Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks

  title={Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks},
  author={Blake Bordelon and Cengiz Pehlevan},
We analyze feature learning in infinite width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training. These kernel order parameters collectively define the hidden layer activation distribution, the evolution of… 

Meta-Principled Family of Hyperparameter Scaling Strategies

A one-parameter family of hyperparameter scaling strategies that interpolates between the neural-tangent scaling and mean-field/maximal-update scaling is derived, revealing a proper way to scale depth with width such that resultant large-scale models maintain their representation-learning ability.

The Influence of Learning Rule on Representation Dynamics in Wide Neural Networks

It is shown that initial correlation ρ between forward and backward pass weights alters the inductive bias of FA in both lazy and rich regimes, and is a step towards understanding learned representations in neural networks.

A theory of representation learning in deep neural networks gives a deep generalisation of kernel methods

A new in-nite width limit, the representation learning limit, is developed that exhibits representation learning mirroring that in finite-width networks, yet at the same time, remains extremely tractable.

A Kernel Analysis of Feature Learning in Deep Neural Networks

  • Abdulkadir CanatarC. Pehlevan
  • Computer Science, Biology
    2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
  • 2022
This work empirically study the kernels induced by the layer representations during training by analyzing their kernel alignment to the network's target function and shows that representations from earlier to deeper layers increasingly align with the target task for both training and test sets, implying better generalization.

Second-order regression models exhibit progressive sharpening to the edge of stability

This work proves that for quadratic objectives in two dimensions, this second-order regression model exhibits progressive sharpening of the NTK eigenvalue towards a value that differs slightly from the edge of stability, which it explicitly compute.

Dynamical Mean Field Theory of Kernel Evolution in Wide Neural Networks

A collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points are constructed, providing a reduced description of network activity through training.

Decomposing neural networks as mappings of correlation functions

The mapping between probability distributions implemented by a deep feed-forward network is studied as an iterated transformation of distributions, where the non-linearity in each layer transfers information between different orders of correlation functions to identify essential statistics in the data.



Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

This paper shows that the number of hidden units only needs to be larger than a quantity dependent on the regularity properties of the data, and independent of the dimensions, and generalizes this analysis to the case of unbounded activation functions.

A self consistent theory of Gaussian Processes captures feature learning effects in finite CNNs

This work considers DNNs trained with noisy gradient descent on a large training set and derives a self-consistent Gaussian Process theory accounting for strong finite-DNN and feature learning effects and identifies a sharp transition between a feature learning regime and a lazy learning regime in this model.

A Theory of Neural Tangent Kernel Alignment and Its Influence on Training

This work seeks to theoretically understand kernel alignment, a prominent and ubiquitous structural change that aligns the NTK with the target function, and identifies factors in network architecture and data structure that drive kernel alignment.

Wide neural networks of any depth evolve as linear models under gradient descent

This work shows that for wide NNs the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.

Statistical Mechanics of Deep Linear Neural Networks: The Backpropagating Kernel Renormalization

This work is the first exact statistical mechanical study of learning in a family of Deep Neural Networks, and the first successful theory of learning through the successive integration of Degrees of Freedom in the learned weight space.

Unified Field Theory for Deep and Recurrent Neural Networks

A unified and systematic derivation of the mean-field theory for both architectures that starts from first principles by employing established methods from statistical physics of disordered systems is presented, exposing that Gaussian processes are but the lowest order of a systematic expansion in 1/n.

Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks

A new spectral principle is identified: as the size of the training set grows, kernel machines and neural networks fit successively higher spectral modes of the target function.

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

This work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures.

Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification

This work analyzes in a closed form the learning dynamics of the stochastic gradient descent for a single-layer neural network classifying a high-dimensional Gaussian mixture where each cluster is assigned one of two labels and explores the performance of the algorithm as a function of the control parameters shedding light on how it navigates the loss landscape.

Mean Field Residual Networks: On the Edge of Chaos

It is shown, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth.