• Corpus ID: 237057359

Doubly infinite residual neural networks: a diffusion process approach

@article{Peluchetti2021DoublyIR,
  title={Doubly infinite residual neural networks: a diffusion process approach},
  author={Stefano Peluchetti and Stefano Favaro and Philipp Hennig},
  journal={J. Mach. Learn. Res.},
  year={2021},
  volume={22},
  pages={175:1-175:48}
}
Modern neural networks featuring a large number of layers (depth) and units per layer (width) have achieved a remarkable performance across many domains. While there exists a vast literature on the interplay between infinitely wide neural networks and Gaussian processes, a little is known about analogous interplays with respect to infinitely deep neural networks. Neural networks with independent and identically distributed (i.i.d.) initializations exhibit undesirable forward and backward… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 45 REFERENCES
Infinitely deep neural networks as diffusion processes
TLDR
This work considers parameter distributions that shrink as the number of layers increases in order to recover well-behaved stochastic processes in the limit of infinite depth to set forth a link between infinitely deep residual networks and solutions to stochastically differential equations.
Deep Neural Networks as Gaussian Processes
TLDR
The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks.
Gaussian Process Behaviour in Wide Deep Neural Networks
TLDR
It is shown that, under broad conditions, as the authors make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks.
Mean Field Residual Networks: On the Edge of Chaos
TLDR
It is shown, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth.
Deep Information Propagation
TLDR
The presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks, and a mean field theory for backpropagation is developed that shows that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively.
Exponential expressivity in deep neural networks through transient chaos
TLDR
The theoretical analysis of the expressive power of deep networks broadly applies to arbitrary nonlinearities, and provides a quantitative underpinning for previously abstract notions about the geometry of deep functions.
Neural Ordinary Differential Equations
TLDR
This work shows how to scalably backpropagate through any ODE solver, without access to its internal operations, which allows end-to-end training of ODEs within larger models.
Understanding the difficulty of training deep feedforward neural networks
TLDR
The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
Identity Mappings in Deep Residual Networks
TLDR
The propagation formulations behind the residual building blocks suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation.
On the Impact of the Activation Function on Deep Neural Networks Training
TLDR
A comprehensive theoretical analysis of the Edge of Chaos is given and it is shown that one can indeed tune the initialization parameters and the activation function in order to accelerate the training and improve the performance.
...
...