# Why bigger is not always better: on finite and infinite neural networks

@article{Aitchison2020WhyBI, title={Why bigger is not always better: on finite and infinite neural networks}, author={Laurence Aitchison}, journal={ArXiv}, year={2020}, volume={abs/1910.08013} }

Recent work has argued that neural networks can be understood theoretically by taking the number of channels to infinity, at which point the outputs become Gaussian process (GP) distributed. However, we note that infinite Bayesian neural networks lack a key facet of the behaviour of real neural networks: the fixed kernel, determined only by network hyperparameters, implies that they cannot do any form of representation learning. The lack of representation or equivalently kernel learning leads…

## 30 Citations

Deep neural networks with dependent weights: Gaussian Process mixture limit, heavy tails, sparsity and compressibility

- Computer ScienceArXiv
- 2022

The inﬁnite-width limit of deep feedforward neural networks whose weights are dependent, and modelled via a mixture of Gaussian distributions is studied, and it is shown that, in this regime, the weights are compressible, and feature learning is possible.

Asymptotics of representation learning in finite Bayesian neural networks

- Computer ScienceNeurIPS
- 2021

It is argued that the leading finitewidth corrections to the average feature kernels for any Bayesian network with linear readout and Gaussian likelihood have a largely universal form.

Deep kernel machines: exact inference with representation learning in infinite Bayesian neural networks

- Computer Science
- 2021

This work gives a proof of unimodality for linear kernels, and a number of experiments in the nonlinear case in which all deep kernel machines initializations the authors tried converged to the same solution.

Supplemental Material for: “Asymptotics of representation learning in finite Bayesian neural networks”

- Computer Science
- 2021

The cumulant generating function of learned features for a MLP and general form of the perturbative layer integrals for a deep linear network are studied.

Depth induces scale-averaging in overparameterized linear Bayesian neural networks

- Computer Science2021 55th Asilomar Conference on Signals, Systems, and Computers
- 2021

Finite deep linear Bayesian neural networks are interpreted as data-dependent scale mixtures of Gaussian process predictors across output channels to study representation learning in these networks, allowing us to connect limiting results obtained in previous studies within a unified framework.

Separation of scales and a thermodynamic description of feature learning in some CNNs

- Computer ScienceArXiv
- 2021

It is shown that DNN layers couple only through the second moment (kernels) of their activations and pre-activations, which indicates a separation of scales occurring in fully trained over-parameterized deep convolutional neural networks (CNNs).

Deep kernel processes

- Computer ScienceICML
- 2021

A tractable deep kernel process, the deep inverse Wishart process, is defined, and a doubly-stochastic inducing-point variational inference scheme is given that operates on the Gram matrices, not on the features, as in DGPs.

A fast point solver for deep nonlinear function approximators

- Computer ScienceArXiv
- 2021

A Newton-like method for DKPs that converges in around 10 steps, exploiting matrix solvers initially developed in the control theory literature, and generalise to arbitrary DKP architectures, by developing “ kernel backprop”, and algorithms for “kernel autodiff”.

Complexity from Adaptive-Symmetries Breaking: Global Minima in the Statistical Mechanics of Deep Neural Networks

- PhysicsArXiv
- 2022

This work states that in complex systems such as DNA molecules, different nucleotide sequences consist of different weak bonds with similar free energy; and energy fluctuations would break the symmetries that conserve the free energy of theucleotide sequences, which selected by the environment would lead to organisms with different phenotypes.

Do autoencoders need a bottleneck for anomaly detection?

- Computer ScienceArXiv
- 2022

It is found that non-bottlenecked architectures can outperform their bottlenecked counterparts on the popular task of CIFAR (inliers) vs SVHN (anomalies), among other tasks, shedding light on the potential of developing non- Bottlenecked AEs for improving anomaly detection.

## References

SHOWING 1-10 OF 18 REFERENCES

Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

- Computer ScienceICLR
- 2019

This work derives an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and introduces a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible.

Gaussian Process Behaviour in Wide Deep Neural Networks

- Computer ScienceICLR
- 2018

It is shown that, under broad conditions, as the authors make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks.

Deep Convolutional Networks as shallow Gaussian Processes

- Computer ScienceICLR
- 2019

We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many…

Wide Neural Networks with Bottlenecks are Deep Gaussian Processes

- Computer ScienceJ. Mach. Learn. Res.
- 2020

This paper considers the wide limit of BNNs where some hidden layers, called "bottlenecks", are held at finite width, and produces a composition of GPs that is a "bottleneck neural network Gaussian process" (bottleneck NNGP).

Enhanced Convolutional Neural Tangent Kernels

- Computer ScienceArXiv
- 2019

The resulting kernel, CNN-GP with LAP and horizontal flip data augmentation, achieves 89% accuracy, matching the performance of AlexNet, which is the best such result the authors know of for a classifier that is not a trained neural network.

On Exact Computation with an Infinitely Wide Neural Net

- Computer ScienceNeurIPS
- 2019

The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.

Deep Neural Networks as Gaussian Processes

- Computer ScienceICLR
- 2018

The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks.

Neural Ordinary Differential Equations

- Computer ScienceNeurIPS
- 2018

This work shows how to scalably backpropagate through any ODE solver, without access to its internal operations, which allows end-to-end training of ODEs within larger models.

Neural tangent kernel: convergence and generalization in neural networks (invited paper)

- Computer ScienceNeurIPS
- 2018

This talk will introduce this formalism and give a number of results on the Neural Tangent Kernel and explain how they give us insight into the dynamics of neural networks during training and into their generalization features.

Deep Gaussian Processes for Regression using Approximate Expectation Propagation

- Computer ScienceICML
- 2016

A new approximate Bayesian learning scheme is developed that enables DGPs to be applied to a range of medium to large scale regression problems for the first time and is almost always better than state-of-the-art deterministic and sampling-based approximate inference methods for Bayesian neural networks.