• Corpus ID: 203610462

Non-Gaussian processes and neural networks at finite widths

@inproceedings{Yaida2019NonGaussianPA,
  title={Non-Gaussian processes and neural networks at finite widths},
  author={Sho Yaida},
  booktitle={Mathematical and Scientific Machine Learning},
  year={2019}
}
  • Sho Yaida
  • Published in
    Mathematical and Scientific…
    25 September 2019
  • Computer Science
Gaussian processes are ubiquitous in nature and engineering. A case in point is a class of neural networks in the infinite-width limit, whose priors correspond to Gaussian processes. Here we perturbatively extend this correspondence to finite-width neural networks, yielding non-Gaussian processes as priors. The methodology developed herein allows us to track the flow of preactivation distributions by progressively integrating out random variables from lower to higher layers, reminiscent of… 

Figures from this paper

Non-asymptotic approximations of neural networks by Gaussian processes

The extent to which wide neural networks may be approximated by Gaussian processes, when initialized with random weights is studied, by establishing explicit convergence rates for the central limit theorem in an infinite-dimensional functional space, metrized with a natural transportation distance.

Exact priors of finite neural networks

This work derives exact solutions for the output priors for individual input examples of a class of finite fully-connected feedforward Bayesian neural networks.

Neural networks and quantum field theory

This work proposes a theoretical understanding of neural networks in terms of Wilsonian effective field theory, which is valid for any of the many architectures that becomes a GP in an asymptotic limit, a property preserved under certain types of training.

On the asymptotics of wide networks with polynomial activations

The conjecture for deep networks with polynomial activation functions is proved, greatly extending the validity of these results and pointing out a difference in the asymptotic behavior of networks with analytic (and non-linear) activation functions and those with piecewise-linear activations such as ReLU.

Finite Versus Infinite Neural Networks: an Empirical Study

Improved best practices for using NNGP and NT kernels for prediction are developed, including a novel ensembling technique that achieves state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class the authors consider.

Asymptotics of Wide Convolutional Neural Networks

It is found that the difference in performance between finite and infinite width models vanishes at a definite rate with respect to model width, consistent with finite width models generalizing either better or worse than their infinite width counterparts.

Predicting the outputs of finite networks trained with noisy gradients

A DNN training protocol involving noise whose outcome is mappable to a certain non-Gaussian stochastic process and is able to predict the outputs of empirical finite networks with high accuracy, improving upon the accuracy of GP predictions by over an order of magnitude.

Explaining Neural Scaling Laws

This work identifies variance-limited and resolution-limited scaling behavior for both dataset and model size, and identifies four related scaling regimes with respect to the number of model parameters P and the dataset size D.

Explicitly Bayesian Regularizations in Deep Learning

A novel probabilistic representation for the hidden layers of CNNs is introduced and it is demonstrated that CNNs correspond to Bayesian networks with the serial connection, thus CNNs have explicitly Bayesian regularizations based on the Bayesianregularization theory.

Generalization bounds for deep learning

Desiderata for techniques that predict generalization errors for deep learning models in supervised learning are introduced, and a marginal-likelihood PAC-Bayesian bound is derived that fulfills desiderata 1-3 and 5.

References

SHOWING 1-10 OF 36 REFERENCES

Finite size corrections for neural network Gaussian processes

It is demonstrated that for an ensemble of large, finite, fully connected networks with a single hidden layer the distribution of outputs at initialization is well described by a Gaussian perturbed by the fourth Hermite polynomial for weights drawn from a symmetric distribution.

Computing with Infinite Networks

For neural networks with a wide class of weight-priors, it can be shown that in the limit of an infinite number of hidden units the prior over functions tends to a Gaussian process. In this paper

Gaussian Process Behaviour in Wide Deep Neural Networks

It is shown that, under broad conditions, as the authors make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks.

Deep Neural Networks as Gaussian Processes

The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks.

Priors for Infinite Networks

In this chapter, I show that priors over network parameters can be defined in such a way that the corresponding priors over functions computed by the network reach reasonable limits as the number of

Wide neural networks of any depth evolve as linear models under gradient descent

This work shows that for wide NNs the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.

Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

This work derives an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and introduces a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible.

Deep Information Propagation

The presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks, and a mean field theory for backpropagation is developed that shows that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively.

Asymptotics of Wide Networks from Feynman Diagrams

The method is an adaptation of Feynman diagrams, a standard tool for computing multivariate Gaussian integrals, and applies to study training dynamics, improving existing bounds and deriving new results on wide network evolution during stochastic gradient descent.

Neural tangent kernel: convergence and generalization in neural networks (invited paper)

This talk will introduce this formalism and give a number of results on the Neural Tangent Kernel and explain how they give us insight into the dynamics of neural networks during training and into their generalization features.