• Corpus ID: 250420855

A theory of representation learning in deep neural networks gives a deep generalisation of kernel methods

@inproceedings{Yang2021ATO,
  title={A theory of representation learning in deep neural networks gives a deep generalisation of kernel methods},
  author={Adam X. Yang and Maxime Robeyns and Edward Milsom and Nandi Schoots and Laurence Aitchison},
  year={2021}
}
The successes of modern deep neural networks (DNNs) are founded on their ability to transform inputs across multiple layers to build good high-level representations. It is therefore critical to understand this process of representation learning. How-ever, we cannot use standard theoretical approaches involving infinite width limits, as they eliminate representation learning. We therefore develop a new infinite width limit, the representation learning limit, that exhibits representation learning… 

Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks

Comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory are provided, showing that each of these approximations can break down in regimes where general self- Consistent solutions still provide an accurate description.

Neural Networks as Paths through the Space of Representations

This work develops a simple idea for interpreting the layer-by-layer construction of useful representations, and formalizes this intuitive idea of “distance” by leveraging recent work on metric representational similarity, and shows how it leads to a rich space of geometric concepts.

References

SHOWING 1-10 OF 47 REFERENCES

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Doubly Stochastic Variational Inference for Deep Gaussian Processes

This work presents a doubly stochastic variational inference algorithm, which does not force independence between layers in Deep Gaussian processes, and provides strong empirical evidence that the inference scheme for DGPs works well in practice in both classification and regression.

Why bigger is not always better: on finite and infinite neural networks

This work gives analytic results characterising the prior over representations and representation learning in finite deep linear networks and shows empirically that the representations in SOTA architectures such as ResNets trained with SGD are much closer to those suggested by the deep linear results than by the corresponding infinite network.

Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks

Comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory are provided, showing that each of these approximations can break down in regimes where general self- Consistent solutions still provide an accurate description.

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

This work shows that, in the recently discovered Maximal Update Parametrization ( µ P), many optimal HPs remain stable even as model size changes, and introduces a new HP tuning paradigm, µ Transfer, which parametrize the target model in µ P, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all.

Separation of scales and a thermodynamic description of feature learning in some CNNs

It is shown that DNN layers couple only through the second moment (kernels) of their activations and pre-activations, which indicates a separation of scales occurring in fully trained over-parameterized deep convolutional neural networks (CNNs).

Asymptotics of representation learning in finite Bayesian neural networks

It is argued that the leading finite-width corrections to the average feature kernels for any Bayesian network with linear readout and Gaussian likelihood have a largely universal form.

Exact marginal prior distributions of finite Bayesian neural networks

This work derives exact solutions for the function space priors for individual input examples of a class of finite fully-connected feedforward Bayesian neural networks in terms of their tail decay and large-width behavior.

Statistical Mechanics of Deep Linear Neural Networks: The Back-Propagating Renormalization Group

This work is the first exact statistical mechanical study of learning in a family of Deep Neural Networks, and the first development of the Renormalization Group approach to the weight space of these systems.

Feature Learning in Infinite-Width Neural Networks

It is shown that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT, and any such infinite- width limit can be computed using the Tensor Programs technique.