# Infinitely deep neural networks as diffusion processes

@inproceedings{Peluchetti2020InfinitelyDN, title={Infinitely deep neural networks as diffusion processes}, author={Stefano Peluchetti and Stefano Favaro}, booktitle={AISTATS}, year={2020} }

When the parameters are independently and identically distributed (initialized) neural networks exhibit undesirable properties that emerge as the number of layers increases, e.g. a vanishing dependency on the input and a concentration on restrictive families of functions including constant functions. We consider parameter distributions that shrink as the number of layers increases in order to recover well-behaved stochastic processes in the limit of infinite depth. This leads to set forth a…

## 17 Citations

Doubly infinite residual neural networks: a diffusion process approach

- Computer ScienceJ. Mach. Learn. Res.
- 2021

This paper reviews the results of Peluchetti and Favaro (2020), extending them to convolutional ResNets, and establishes analogous backward-propagation results, which directly relate to the problem of training fully-connected deep ResNet.

Doubly infinite residual networks: a diffusion process approach

- Computer ScienceArXiv
- 2020

The forward-propagation results of Peluchetti and Favaro (2020) are extended to the setting of convolutional ResNets, and results point to a limited expressive power of doubly infinite ResNet when the unscaled parameters are i.i.d, and residual blocks are shallow.

Quantitative Gaussian Approximation of Randomly Initialized Deep Neural Networks

- Computer ScienceArXiv
- 2022

The authors' explicit inequalities indicate how the hidden and output layers sizes affect the Gaussian behaviour of the network and quantitatively recover the distributional convergence results in the wide limit, i.e., if all the hidden layers sizes become large.

Scaling Properties of Deep Residual Networks

- Computer ScienceICML
- 2021

Findings cast doubts on the validity of the neural ODE model as an adequate asymptotic description of deep ResNets and point to an alternative class of differential equations as a better description of the deep network limit.

Estimating Full Lipschitz Constants of Deep Neural Networks

- Computer Science, MathematicsArXiv
- 2020

Estimates of the Lipschitz constants of the gradient of a deep neural network and the network itself with respect to the full set of parameters are developed and can be used to set the step size of stochastic gradient descent methods.

Infinitely Deep Bayesian Neural Networks with Stochastic Differential Equations

- Computer ScienceAISTATS
- 2022

This approach brings continuous-depth Bayesian neural nets to a competitive comparison against discrete-depth alternatives, while inheriting the memory-e ﬁnite-parameter training and tunable precision of Neural ODEs.

Scaling ResNets in the Large-depth Regime

- Computer ScienceArXiv
- 2022

This analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index, and exhibits a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.

Learning Continuous-Time Dynamics by Stochastic Differential Networks

- Computer ScienceArXiv
- 2020

This work applies Variational Bayesian method and proposes a flexible continuous-time framework named Variational Stochastic Differential Networks (VSDN), which can model high-dimensional nonlinear stochastic dynamics by deep neural networks.

Stochastic Normalizing Flows

- Mathematics, Computer ScienceArXiv
- 2020

Stochasticnormalizing flows are introduced, an extension of continuous normalizing flows for maximum likelihood estimation and variational inference (VI) using stochastic differential equations (SDEs) and can apply VI to the optimization of hyperparameters in stochastically MCMC.

Stochastic continuous normalizing flows: training SDEs as ODEs

- Computer Science, MathematicsUAI
- 2021

Using the theory of rough paths, the underlying Brownian motion is treated as a latent variable and approximated, enabling the treatment of SDEs as random ordinary differential equations, which can be trained using existing techniques.

## References

SHOWING 1-10 OF 40 REFERENCES

Deep Information Propagation

- Computer ScienceICLR
- 2017

The presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks, and a mean field theory for backpropagation is developed that shows that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively.

Neural Ordinary Differential Equations

- Computer ScienceNeurIPS
- 2018

This work shows how to scalably backpropagate through any ODE solver, without access to its internal operations, which allows end-to-end training of ODEs within larger models.

Gaussian Process Behaviour in Wide Deep Neural Networks

- Computer ScienceICLR
- 2018

It is shown that, under broad conditions, as the authors make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks.

The Emergence of Spectral Universality in Deep Networks

- Computer ScienceAISTATS
- 2018

This work uses powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth.

Deep Convolutional Networks as shallow Gaussian Processes

- Computer ScienceICLR
- 2019

We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many…

Deep Neural Networks as Gaussian Processes

- Computer ScienceICLR
- 2018

The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks.

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

- Computer ScienceNeurIPS
- 2019

This work shows that for wide NNs the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.

Initialization of ReLUs for Dynamical Isometry

- Computer ScienceNeurIPS
- 2019

The joint signal output distribution exactly is derived exactly, without mean field assumptions, for fully-connected networks with Gaussian weights and biases, and deviations from the mean field results are analyzed.

Exponential expressivity in deep neural networks through transient chaos

- Computer ScienceNIPS
- 2016

The theoretical analysis of the expressive power of deep networks broadly applies to arbitrary nonlinearities, and provides a quantitative underpinning for previously abstract notions about the geometry of deep functions.

Mean Field Residual Networks: On the Edge of Chaos

- Computer ScienceNIPS
- 2017

It is shown, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth.