# The Future is Log-Gaussian: ResNets and Their Infinite-Depth-and-Width Limit at Initialization

@article{Li2021TheFI, title={The Future is Log-Gaussian: ResNets and Their Infinite-Depth-and-Width Limit at Initialization}, author={Mufan Bill Li and Mihai Nica and Daniel M. Roy}, journal={ArXiv}, year={2021}, volume={abs/2106.04013} }

Theoretical results show that neural networks can be approximated by Gaussian processes in the infinite-width limit. However, for fully connected networks, it has been previously shown that for any fixed network width, n, the Gaussian approximation gets worse as the network depth, d, increases. Given that modern networks are deep, this raises the question of how well modern architectures, like ResNets, are captured by the infinite-width limit. To provide a better approximation, we study ReLU…

## Figures and Tables from this paper

## 2 Citations

Deep Stable neural networks: large-width asymptotics and convergence rates

- Computer Science, MathematicsArXiv
- 2021

This paper establishes sup-norm convergence rates of a deep Stable NN to a Stable SP, quantifying the critical difference between the settings of “joint growth’ and “sequential growth” of the width over the NN’s layers, providing the first result on convergence rates for infinite-wide deep NNs.

Precise characterization of the prior predictive distribution of deep ReLU networks

- Computer ScienceArXiv
- 2021

This work derives a precise characterization of the prior predictive distribution of finite-width ReLU networks with Gaussian weights based on the Meijer-G function, demonstrating that the moments of the distribution converge to those of a normal log-normal mixture in the infinite depth limit.

## References

SHOWING 1-10 OF 62 REFERENCES

Finite Depth and Width Corrections to the Neural Tangent Kernel

- Computer Science, MathematicsICLR
- 2020

The results suggest that, unlike relatively shallow and wide networks, deep and wide ReLU networks are capable of learning data-dependent features even in the so-called lazy training regime.

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

- Computer Science, MathematicsNeurIPS
- 2019

This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.

Feature Learning in Infinite-Width Neural Networks

- Computer Science, PhysicsArXiv
- 2020

It is shown that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT, and any such infinite- width limit can be computed using the Tensor Programs technique.

Deep Neural Networks as Gaussian Processes

- Computer Science, MathematicsICLR
- 2018

The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks.

A Convergence Theory for Deep Learning via Over-Parameterization

- Computer Science, MathematicsICML
- 2019

This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.

On Exact Computation with an Infinitely Wide Neural Net

- Computer Science, MathematicsNeurIPS
- 2019

The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.

Gaussian Process Behaviour in Wide Deep Neural Networks

- Computer Science, MathematicsICLR
- 2018

It is shown that, under broad conditions, as the authors make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks.

How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?

- Computer Science, MathematicsICLR
- 2021

This work establishes sharp optimization and generalization guarantees for deep ReLU networks under various assumptions made in previous work, and shows that network width polylogarithmic in $n$ and $\epsilon^{-1}$.

HOW MUCH OVER-PARAMETERIZATION IS SUFFI-

- 2020

A recent line of research on deep learning focuses on the extremely overparameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample…

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

- Computer Science, PhysicsArXiv
- 2019

This work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures.