• Corpus ID: 212675182

# A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth

@article{Lu2020AMA,
title={A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth},
author={Yiping Lu and Chao Ma and Yulong Lu and Jianfeng Lu and Lexing Ying},
journal={ArXiv},
year={2020},
volume={abs/2003.05508}
}
• Published 11 March 2020
• Computer Science
• ArXiv
Training deep neural networks with stochastic gradient descent (SGD) can often achieve zero training loss on real-world tasks although the optimization landscape is known to be highly non-convex. To understand the success of SGD for training deep neural networks, this work presents a mean-field analysis of deep residual networks, based on a line of works that interpret the continuum limit of the deep residual network as an ordinary differential equation when the network capacity tends to…
39 Citations

## Figures and Tables from this paper

SCENT FOR MULTI-LAYER R ES N ETS IN THE MEAN-FIELD REGIME
• Computer Science, Mathematics
• 2021
It is shown that if the ResNet is sufficiently large, with depth and width depending algebraically on the accuracy and confidence levels, first order optimization methods can find global minimizers that fit the training data.
Overparameterization of deep ResNet: zero loss and mean-field analysis
• Computer Science
J. Mach. Learn. Res.
• 2022
This work uses a mean-field-limit argument to prove that the gradient descent for parameter training becomes a partial differential equation (PDE) that characterizes gradient flow for a probability distribution in the large-NN limit, and shows that the solution to the PDE converges in the training time to a zero-loss solution.
Neural Network Approximation: Three Hidden Layers Are Enough
• Computer Science, Mathematics
Neural Networks
• 2021
Convergence Analysis of Deep Residual Networks
• Computer Science
ArXiv
• 2022
A matrix-vector description of general deep neural networks with shortcut connections is given and an explicit expression for the networks is formulated by using the notions of activation domains and activation matrices to characterize the convergence of deep Residual Networks.
Deep Architecture Connectivity Matters for Its Convergence: A Fine-Grained Analysis
• Computer Science
• 2022
This work theoretically characterize the impact of connectivity patterns on the convergence of DNNs under gradient descent training in fine granularity, and shows that by a simple filtration on “unpromising" connectivity patterns, it can trim down the number of models to evaluate, and significantly accelerate the large-scale neural architecture search without any overhead.
Convergence and Implicit Regularization Properties of Gradient Descent for Deep Residual Networks
• Computer Science, Mathematics
SSRN Electronic Journal
• 2022
We prove linear convergence of gradient descent to a global minimum for the training of deep residual networks with constant layer width and smooth activation function. We further show that the
On Feature Learning in Neural Networks with Global Convergence Guarantees
• Computer Science
ArXiv
• 2022
A model of wide multi-layer NNs whose second-to-last layer is trained via GF is studied, for which it is proved that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
Provably convergent quasistatic dynamics for mean-field two-player zero-sum games
• Computer Science
ArXiv
• 2022
Inspired by the continuous dynamics of probability distributions, a quasistatic Langevin gradient descent method with inner-outer iterations is derived, and test the method on different problems, including training mixture of GANs.
Extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms
• Computer Science
PPoPP
• 2022
This work opens the door for unprecedentedly large-scale molecular dynamics simulations based on ab initio accuracy and can be potentially utilized in studying more realistic applications such as mechanical properties of metals, semiconductor devices, batteries, etc.
Learning Poisson systems and trajectories of autonomous systems via Poisson neural networks
• Computer Science, Mathematics
IEEE transactions on neural networks and learning systems
• 2022
This work demonstrates through several simulations that PNNs are capable of handling very accurately several challenging tasks, including the motion of a particle in the electromagnetic potential, the nonlinear Schrödinger equation, and pixel observations of the two-body problem.

## References

SHOWING 1-10 OF 91 REFERENCES
Beyond finite layer neural networks: Bridging deep architectures 26 YIPING LU
• CHAO MA, YULONG LU, JIANFENG LU, LEXING YING and numerical differential equations, arXiv preprint arXiv:1710.10121,
• 2017
Neural Ordinary Differential Equations
• Computer Science
NeurIPS
• 2018
This work shows how to scalably backpropagate through any ODE solver, without access to its internal operations, which allows end-to-end training of ODEs within larger models.
Representing smooth functions as compositions of near-identity functions with implications for deep network optimization
• Mathematics
ArXiv
• 2018
We show that any smooth bi-Lipschitz $h$ can be represented exactly as a composition $h_m \circ ... \circ h_1$ of functions $h_1,...,h_m$ that are close to the identity in the sense that each
Residual Networks Behave Like Ensembles of Relatively Shallow Networks
• Computer Science
NIPS
• 2016
This work proposes a novel interpretation of residual networks showing that they can be seen as a collection of many paths of differing length, and reveals one of the key characteristics that seem to enable the training of very deep networks: Residual networks avoid the vanishing gradient problem by introducing short paths which can carry gradient throughout the extent of veryDeep networks.
Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations
• Computer Science
ICML
• 2018
It is shown that many effective networks, such as ResNet, PolyNet, FractalNet and RevNet, can be interpreted as different numerical discretizations of differential equations and established a connection between stochastic control and noise injection in the training process which helps to improve generalization of the networks.
Scalable Gradients for Stochastic Differential Equations
• Computer Science
AISTATS
• 2020
The adjoint sensitivity method scalably computes gradients of solutions to ordinary differential equations. We generalize this method to stochastic differential equations, allowing time-efficient and
Are deep ResNets provably better than linear predictors?
• Computer Science
NeurIPS
• 2019
The main theorem on deep ResNets shows under simple geometric conditions that, any critical point in the optimization landscape is either at least as good as the best linear predictor; or the Hessian at this critical point has a strictly negative eigenvalue.
Convergence Theory of Learning Over-parameterized ResNet: A Full Characterization.
• Computer Science
• 2019
This paper fully characterize the convergence theory of gradient descent for learning over-parameterized ResNet with different values of $tau$ and shows that for $\tau\le 1/\sqrt{L}$ gradient descent is guaranteed to converge to the global minma, and especially when $\t Tau 1/L$ the convergence is irrelevant of the network depth.
Augmented Neural ODEs
• Computer Science
NeurIPS
• 2019
Augmented Neural ODEs are introduced which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural Odes.