• Corpus ID: 212675182

A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth

@article{Lu2020AMA,
  title={A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth},
  author={Yiping Lu and Chao Ma and Yulong Lu and Jianfeng Lu and Lexing Ying},
  journal={ArXiv},
  year={2020},
  volume={abs/2003.05508}
}
Training deep neural networks with stochastic gradient descent (SGD) can often achieve zero training loss on real-world tasks although the optimization landscape is known to be highly non-convex. To understand the success of SGD for training deep neural networks, this work presents a mean-field analysis of deep residual networks, based on a line of works that interpret the continuum limit of the deep residual network as an ordinary differential equation when the network capacity tends to… 

Figures and Tables from this paper

SCENT FOR MULTI-LAYER R ES N ETS IN THE MEAN-FIELD REGIME
TLDR
It is shown that if the ResNet is sufficiently large, with depth and width depending algebraically on the accuracy and confidence levels, first order optimization methods can find global minimizers that fit the training data.
Overparameterization of deep ResNet: zero loss and mean-field analysis
TLDR
This work uses a mean-field-limit argument to prove that the gradient descent for parameter training becomes a partial differential equation (PDE) that characterizes gradient flow for a probability distribution in the large-NN limit, and shows that the solution to the PDE converges in the training time to a zero-loss solution.
Neural Network Approximation: Three Hidden Layers Are Enough
Convergence Analysis of Deep Residual Networks
TLDR
A matrix-vector description of general deep neural networks with shortcut connections is given and an explicit expression for the networks is formulated by using the notions of activation domains and activation matrices to characterize the convergence of deep Residual Networks.
Deep Architecture Connectivity Matters for Its Convergence: A Fine-Grained Analysis
TLDR
This work theoretically characterize the impact of connectivity patterns on the convergence of DNNs under gradient descent training in fine granularity, and shows that by a simple filtration on “unpromising" connectivity patterns, it can trim down the number of models to evaluate, and significantly accelerate the large-scale neural architecture search without any overhead.
Convergence and Implicit Regularization Properties of Gradient Descent for Deep Residual Networks
We prove linear convergence of gradient descent to a global minimum for the training of deep residual networks with constant layer width and smooth activation function. We further show that the
On Feature Learning in Neural Networks with Global Convergence Guarantees
TLDR
A model of wide multi-layer NNs whose second-to-last layer is trained via GF is studied, for which it is proved that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
Provably convergent quasistatic dynamics for mean-field two-player zero-sum games
TLDR
Inspired by the continuous dynamics of probability distributions, a quasistatic Langevin gradient descent method with inner-outer iterations is derived, and test the method on different problems, including training mixture of GANs.
Extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms
TLDR
This work opens the door for unprecedentedly large-scale molecular dynamics simulations based on ab initio accuracy and can be potentially utilized in studying more realistic applications such as mechanical properties of metals, semiconductor devices, batteries, etc.
Learning Poisson systems and trajectories of autonomous systems via Poisson neural networks
TLDR
This work demonstrates through several simulations that PNNs are capable of handling very accurately several challenging tasks, including the motion of a particle in the electromagnetic potential, the nonlinear Schrödinger equation, and pixel observations of the two-body problem.
...
...

References

SHOWING 1-10 OF 91 REFERENCES
Beyond finite layer neural networks: Bridging deep architectures 26 YIPING LU
  • CHAO MA, YULONG LU, JIANFENG LU, LEXING YING and numerical differential equations, arXiv preprint arXiv:1710.10121,
  • 2017
Neural Ordinary Differential Equations
TLDR
This work shows how to scalably backpropagate through any ODE solver, without access to its internal operations, which allows end-to-end training of ODEs within larger models.
Representing smooth functions as compositions of near-identity functions with implications for deep network optimization
We show that any smooth bi-Lipschitz $h$ can be represented exactly as a composition $h_m \circ ... \circ h_1$ of functions $h_1,...,h_m$ that are close to the identity in the sense that each
Residual Networks Behave Like Ensembles of Relatively Shallow Networks
TLDR
This work proposes a novel interpretation of residual networks showing that they can be seen as a collection of many paths of differing length, and reveals one of the key characteristics that seem to enable the training of very deep networks: Residual networks avoid the vanishing gradient problem by introducing short paths which can carry gradient throughout the extent of veryDeep networks.
Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations
TLDR
It is shown that many effective networks, such as ResNet, PolyNet, FractalNet and RevNet, can be interpreted as different numerical discretizations of differential equations and established a connection between stochastic control and noise injection in the training process which helps to improve generalization of the networks.
Scalable Gradients for Stochastic Differential Equations
The adjoint sensitivity method scalably computes gradients of solutions to ordinary differential equations. We generalize this method to stochastic differential equations, allowing time-efficient and
Are deep ResNets provably better than linear predictors?
TLDR
The main theorem on deep ResNets shows under simple geometric conditions that, any critical point in the optimization landscape is either at least as good as the best linear predictor; or the Hessian at this critical point has a strictly negative eigenvalue.
Convergence Theory of Learning Over-parameterized ResNet: A Full Characterization.
TLDR
This paper fully characterize the convergence theory of gradient descent for learning over-parameterized ResNet with different values of $tau$ and shows that for $\tau\le 1/\sqrt{L}$ gradient descent is guaranteed to converge to the global minma, and especially when $\t Tau 1/L$ the convergence is irrelevant of the network depth.
Mathematical Theory of Optimal Processes
Augmented Neural ODEs
TLDR
Augmented Neural ODEs are introduced which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural Odes.
...
...