Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks

@article{Bordelon2022SelfConsistentDF,
  title={Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks},
  author={Blake Bordelon and Cengiz Pehlevan},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.09653}
}
We analyze feature learning in infinite width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training. These kernel order parameters collectively define the hidden layer activation distribution, the evolution of… 

The Influence of Learning Rule on Representation Dynamics in Wide Neural Networks

It is unclear how changing the learning rule of a deep neural network alters its learning dynamics and representations. To gain insight into the relationship between learned features, function

A theory of representation learning in deep neural networks gives a deep generalisation of kernel methods

A new in-nite width limit, the representation learning limit, is developed that exhibits representation learning mirroring that in finite-width networks, yet at the same time, remains extremely tractable.

References

SHOWING 1-10 OF 97 REFERENCES

Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks

On Word2Vec and few-shot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, explicit formulas for infinite-width limits are derived exactly and are found to outperform both NTK baselines and finite-width networks.

Feature Learning in Infinite-Width Neural Networks

It is shown that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT, and any such infinite- width limit can be computed using the Tensor Programs technique.

On Lazy Training in Differentiable Programming

This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.

Neural tangent kernel: convergence and generalization in neural networks (invited paper)

This talk will introduce this formalism and give a number of results on the Neural Tangent Kernel and explain how they give us insight into the dynamics of neural networks during training and into their generalization features.

JAX: composable transformations of Python+NumPy programs, 2018

  • 2018

A theory of representation learning in deep neural networks gives a deep generalisation of kernel methods

A new in-nite width limit, the representation learning limit, is developed that exhibits representation learning mirroring that in finite-width networks, yet at the same time, remains extremely tractable.

The Principles of Deep Learning Theory

For the first time, the exciting practical advances in modern artificial intelligence capabilities can be matched with a set of effective principles, providing a timeless blueprint for theoretical research in deep learning.

On the training dynamics of deep networks with L2 regularization

A dynamical schedule for the regularization parameter that improves performance and speeds up training is proposed, and empirical relations between the performance of the model, the L2 coefficient, the learning rate, and the number of training steps are uncovered.

Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification

This work analyzes in a closed form the learning dynamics of the stochastic gradient descent for a single-layer neural network classifying a high-dimensional Gaussian mixture where each cluster is assigned one of two labels and explores the performance of the algorithm as a function of the control parameters shedding light on how it navigates the loss landscape.
...