• Corpus ID: 209315572

Infinitely deep neural networks as diffusion processes

@inproceedings{Peluchetti2020InfinitelyDN,
  title={Infinitely deep neural networks as diffusion processes},
  author={Stefano Peluchetti and Stefano Favaro},
  booktitle={AISTATS},
  year={2020}
}
When the parameters are independently and identically distributed (initialized) neural networks exhibit undesirable properties that emerge as the number of layers increases, e.g. a vanishing dependency on the input and a concentration on restrictive families of functions including constant functions. We consider parameter distributions that shrink as the number of layers increases in order to recover well-behaved stochastic processes in the limit of infinite depth. This leads to set forth a… 

Figures from this paper

Doubly infinite residual neural networks: a diffusion process approach
TLDR
This paper reviews the results of Peluchetti and Favaro (2020), extending them to convolutional ResNets, and establishes analogous backward-propagation results, which directly relate to the problem of training fully-connected deep ResNet.
Doubly infinite residual networks: a diffusion process approach
TLDR
The forward-propagation results of Peluchetti and Favaro (2020) are extended to the setting of convolutional ResNets, and results point to a limited expressive power of doubly infinite ResNet when the unscaled parameters are i.i.d, and residual blocks are shallow.
Quantitative Gaussian Approximation of Randomly Initialized Deep Neural Networks
TLDR
The authors' explicit inequalities indicate how the hidden and output layers sizes affect the Gaussian behaviour of the network and quantitatively recover the distributional convergence results in the wide limit, i.e., if all the hidden layers sizes become large.
Scaling Properties of Deep Residual Networks
TLDR
Findings cast doubts on the validity of the neural ODE model as an adequate asymptotic description of deep ResNets and point to an alternative class of differential equations as a better description of the deep network limit.
Estimating Full Lipschitz Constants of Deep Neural Networks
TLDR
Estimates of the Lipschitz constants of the gradient of a deep neural network and the network itself with respect to the full set of parameters are developed and can be used to set the step size of stochastic gradient descent methods.
Infinitely Deep Bayesian Neural Networks with Stochastic Differential Equations
TLDR
This approach brings continuous-depth Bayesian neural nets to a competitive comparison against discrete-depth alternatives, while inheriting the memory-e finite-parameter training and tunable precision of Neural ODEs.
Scaling ResNets in the Large-depth Regime
TLDR
This analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index, and exhibits a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.
Learning Continuous-Time Dynamics by Stochastic Differential Networks
TLDR
This work applies Variational Bayesian method and proposes a flexible continuous-time framework named Variational Stochastic Differential Networks (VSDN), which can model high-dimensional nonlinear stochastic dynamics by deep neural networks.
Stochastic Normalizing Flows
TLDR
Stochasticnormalizing flows are introduced, an extension of continuous normalizing flows for maximum likelihood estimation and variational inference (VI) using stochastic differential equations (SDEs) and can apply VI to the optimization of hyperparameters in stochastically MCMC.
Stochastic continuous normalizing flows: training SDEs as ODEs
TLDR
Using the theory of rough paths, the underlying Brownian motion is treated as a latent variable and approximated, enabling the treatment of SDEs as random ordinary differential equations, which can be trained using existing techniques.
...
...

References

SHOWING 1-10 OF 40 REFERENCES
Deep Information Propagation
TLDR
The presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks, and a mean field theory for backpropagation is developed that shows that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively.
Neural Ordinary Differential Equations
TLDR
This work shows how to scalably backpropagate through any ODE solver, without access to its internal operations, which allows end-to-end training of ODEs within larger models.
Gaussian Process Behaviour in Wide Deep Neural Networks
TLDR
It is shown that, under broad conditions, as the authors make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks.
The Emergence of Spectral Universality in Deep Networks
TLDR
This work uses powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth.
Deep Convolutional Networks as shallow Gaussian Processes
We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many
Deep Neural Networks as Gaussian Processes
TLDR
The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks.
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
TLDR
This work shows that for wide NNs the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.
Initialization of ReLUs for Dynamical Isometry
TLDR
The joint signal output distribution exactly is derived exactly, without mean field assumptions, for fully-connected networks with Gaussian weights and biases, and deviations from the mean field results are analyzed.
Exponential expressivity in deep neural networks through transient chaos
TLDR
The theoretical analysis of the expressive power of deep networks broadly applies to arbitrary nonlinearities, and provides a quantitative underpinning for previously abstract notions about the geometry of deep functions.
Mean Field Residual Networks: On the Edge of Chaos
TLDR
It is shown, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth.
...
...