# A theory of representation learning in deep neural networks gives a deep generalisation of kernel methods

@inproceedings{Yang2021ATO, title={A theory of representation learning in deep neural networks gives a deep generalisation of kernel methods}, author={Adam X. Yang and Maxime Robeyns and Edward Milsom and Nandi Schoots and Laurence Aitchison}, year={2021} }

The successes of modern deep neural networks (DNNs) are founded on their ability to transform inputs across multiple layers to build good high-level representations. It is therefore critical to understand this process of representation learning. How-ever, we cannot use standard theoretical approaches involving inﬁnite width limits, as they eliminate representation learning. We therefore develop a new inﬁnite width limit, the representation learning limit, that exhibits representation learning…

## 2 Citations

### Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks

- Computer ScienceArXiv
- 2022

Comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory are provided, showing that each of these approximations can break down in regimes where general self- Consistent solutions still provide an accurate description.

### Neural Networks as Paths through the Space of Representations

- Computer ScienceArXiv
- 2022

This work develops a simple idea for interpreting the layer-by-layer construction of useful representations, and formalizes this intuitive idea of “distance” by leveraging recent work on metric representational similarity, and shows how it leads to a rich space of geometric concepts.

## References

SHOWING 1-10 OF 47 REFERENCES

### Adam: A Method for Stochastic Optimization

- Computer ScienceICLR
- 2015

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

### Doubly Stochastic Variational Inference for Deep Gaussian Processes

- Computer ScienceNIPS
- 2017

This work presents a doubly stochastic variational inference algorithm, which does not force independence between layers in Deep Gaussian processes, and provides strong empirical evidence that the inference scheme for DGPs works well in practice in both classification and regression.

### Why bigger is not always better: on finite and infinite neural networks

- Computer ScienceICML
- 2020

This work gives analytic results characterising the prior over representations and representation learning in finite deep linear networks and shows empirically that the representations in SOTA architectures such as ResNets trained with SGD are much closer to those suggested by the deep linear results than by the corresponding infinite network.

### Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks

- Computer ScienceArXiv
- 2022

Comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory are provided, showing that each of these approximations can break down in regimes where general self- Consistent solutions still provide an accurate description.

### Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

- Computer ScienceNeurIPS
- 2021

This work shows that, in the recently discovered Maximal Update Parametrization ( µ P), many optimal HPs remain stable even as model size changes, and introduces a new HP tuning paradigm, µ Transfer, which parametrize the target model in µ P, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all.

### Separation of scales and a thermodynamic description of feature learning in some CNNs

- Computer ScienceArXiv
- 2021

It is shown that DNN layers couple only through the second moment (kernels) of their activations and pre-activations, which indicates a separation of scales occurring in fully trained over-parameterized deep convolutional neural networks (CNNs).

### Asymptotics of representation learning in finite Bayesian neural networks

- Computer ScienceNeurIPS
- 2021

It is argued that the leading finite-width corrections to the average feature kernels for any Bayesian network with linear readout and Gaussian likelihood have a largely universal form.

### Exact marginal prior distributions of finite Bayesian neural networks

- Computer ScienceNeurIPS
- 2021

This work derives exact solutions for the function space priors for individual input examples of a class of finite fully-connected feedforward Bayesian neural networks in terms of their tail decay and large-width behavior.

### Statistical Mechanics of Deep Linear Neural Networks: The Back-Propagating Renormalization Group

- Computer ScienceArXiv
- 2020

This work is the first exact statistical mechanical study of learning in a family of Deep Neural Networks, and the first development of the Renormalization Group approach to the weight space of these systems.

### Feature Learning in Infinite-Width Neural Networks

- Computer ScienceArXiv
- 2020

It is shown that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT, and any such infinite- width limit can be computed using the Tensor Programs technique.