• Corpus ID: 235367976

The Future is Log-Gaussian: ResNets and Their Infinite-Depth-and-Width Limit at Initialization

  title={The Future is Log-Gaussian: ResNets and Their Infinite-Depth-and-Width Limit at Initialization},
  author={Mufan Bill Li and Mihai Nica and Daniel M. Roy},
Theoretical results show that neural networks can be approximated by Gaussian processes in the infinite-width limit. However, for fully connected networks, it has been previously shown that for any fixed network width, n, the Gaussian approximation gets worse as the network depth, d, increases. Given that modern networks are deep, this raises the question of how well modern architectures, like ResNets, are captured by the infinite-width limit. To provide a better approximation, we study ReLU… 

Figures and Tables from this paper

Deep Stable neural networks: large-width asymptotics and convergence rates
This paper establishes sup-norm convergence rates of a deep Stable NN to a Stable SP, quantifying the critical difference between the settings of “joint growth’ and “sequential growth” of the width over the NN’s layers, providing the first result on convergence rates for infinite-wide deep NNs.
Precise characterization of the prior predictive distribution of deep ReLU networks
This work derives a precise characterization of the prior predictive distribution of finite-width ReLU networks with Gaussian weights based on the Meijer-G function, demonstrating that the moments of the distribution converge to those of a normal log-normal mixture in the infinite depth limit.


Finite Depth and Width Corrections to the Neural Tangent Kernel
The results suggest that, unlike relatively shallow and wide networks, deep and wide ReLU networks are capable of learning data-dependent features even in the so-called lazy training regime.
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.
Feature Learning in Infinite-Width Neural Networks
It is shown that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT, and any such infinite- width limit can be computed using the Tensor Programs technique.
Deep Neural Networks as Gaussian Processes
The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks.
A Convergence Theory for Deep Learning via Over-Parameterization
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.
On Exact Computation with an Infinitely Wide Neural Net
The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.
Gaussian Process Behaviour in Wide Deep Neural Networks
It is shown that, under broad conditions, as the authors make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks.
How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?
This work establishes sharp optimization and generalization guarantees for deep ReLU networks under various assumptions made in previous work, and shows that network width polylogarithmic in $n$ and $\epsilon^{-1}$.
A recent line of research on deep learning focuses on the extremely overparameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample
Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation
This work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures.