• Corpus ID: 204744071

Why bigger is not always better: on finite and infinite neural networks

@article{Aitchison2020WhyBI,
  title={Why bigger is not always better: on finite and infinite neural networks},
  author={Laurence Aitchison},
  journal={ArXiv},
  year={2020},
  volume={abs/1910.08013}
}
Recent work has argued that neural networks can be understood theoretically by taking the number of channels to infinity, at which point the outputs become Gaussian process (GP) distributed. However, we note that infinite Bayesian neural networks lack a key facet of the behaviour of real neural networks: the fixed kernel, determined only by network hyperparameters, implies that they cannot do any form of representation learning. The lack of representation or equivalently kernel learning leads… 

Figures from this paper

Deep neural networks with dependent weights: Gaussian Process mixture limit, heavy tails, sparsity and compressibility
TLDR
The infinite-width limit of deep feedforward neural networks whose weights are dependent, and modelled via a mixture of Gaussian distributions is studied, and it is shown that, in this regime, the weights are compressible, and feature learning is possible.
Asymptotics of representation learning in finite Bayesian neural networks
TLDR
It is argued that the leading finitewidth corrections to the average feature kernels for any Bayesian network with linear readout and Gaussian likelihood have a largely universal form.
Deep kernel machines: exact inference with representation learning in infinite Bayesian neural networks
TLDR
This work gives a proof of unimodality for linear kernels, and a number of experiments in the nonlinear case in which all deep kernel machines initializations the authors tried converged to the same solution.
Supplemental Material for: “Asymptotics of representation learning in finite Bayesian neural networks”
TLDR
The cumulant generating function of learned features for a MLP and general form of the perturbative layer integrals for a deep linear network are studied.
Depth induces scale-averaging in overparameterized linear Bayesian neural networks
TLDR
Finite deep linear Bayesian neural networks are interpreted as data-dependent scale mixtures of Gaussian process predictors across output channels to study representation learning in these networks, allowing us to connect limiting results obtained in previous studies within a unified framework.
Separation of scales and a thermodynamic description of feature learning in some CNNs
TLDR
It is shown that DNN layers couple only through the second moment (kernels) of their activations and pre-activations, which indicates a separation of scales occurring in fully trained over-parameterized deep convolutional neural networks (CNNs).
Deep kernel processes
TLDR
A tractable deep kernel process, the deep inverse Wishart process, is defined, and a doubly-stochastic inducing-point variational inference scheme is given that operates on the Gram matrices, not on the features, as in DGPs.
A fast point solver for deep nonlinear function approximators
TLDR
A Newton-like method for DKPs that converges in around 10 steps, exploiting matrix solvers initially developed in the control theory literature, and generalise to arbitrary DKP architectures, by developing “ kernel backprop”, and algorithms for “kernel autodiff”.
Complexity from Adaptive-Symmetries Breaking: Global Minima in the Statistical Mechanics of Deep Neural Networks
TLDR
This work states that in complex systems such as DNA molecules, different nucleotide sequences consist of different weak bonds with similar free energy; and energy fluctuations would break the symmetries that conserve the free energy of theucleotide sequences, which selected by the environment would lead to organisms with different phenotypes.
Do autoencoders need a bottleneck for anomaly detection?
TLDR
It is found that non-bottlenecked architectures can outperform their bottlenecked counterparts on the popular task of CIFAR (inliers) vs SVHN (anomalies), among other tasks, shedding light on the potential of developing non- Bottlenecked AEs for improving anomaly detection.
...
1
2
3
...

References

SHOWING 1-10 OF 18 REFERENCES
Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes
TLDR
This work derives an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and introduces a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible.
Gaussian Process Behaviour in Wide Deep Neural Networks
TLDR
It is shown that, under broad conditions, as the authors make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks.
Deep Convolutional Networks as shallow Gaussian Processes
We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many
Wide Neural Networks with Bottlenecks are Deep Gaussian Processes
TLDR
This paper considers the wide limit of BNNs where some hidden layers, called "bottlenecks", are held at finite width, and produces a composition of GPs that is a "bottleneck neural network Gaussian process" (bottleneck NNGP).
Enhanced Convolutional Neural Tangent Kernels
TLDR
The resulting kernel, CNN-GP with LAP and horizontal flip data augmentation, achieves 89% accuracy, matching the performance of AlexNet, which is the best such result the authors know of for a classifier that is not a trained neural network.
On Exact Computation with an Infinitely Wide Neural Net
TLDR
The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.
Deep Neural Networks as Gaussian Processes
TLDR
The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks.
Neural Ordinary Differential Equations
TLDR
This work shows how to scalably backpropagate through any ODE solver, without access to its internal operations, which allows end-to-end training of ODEs within larger models.
Neural tangent kernel: convergence and generalization in neural networks (invited paper)
TLDR
This talk will introduce this formalism and give a number of results on the Neural Tangent Kernel and explain how they give us insight into the dynamics of neural networks during training and into their generalization features.
Deep Gaussian Processes for Regression using Approximate Expectation Propagation
TLDR
A new approximate Bayesian learning scheme is developed that enables DGPs to be applied to a range of medium to large scale regression problems for the first time and is almost always better than state-of-the-art deterministic and sampling-based approximate inference methods for Bayesian neural networks.
...
1
2
...