Corpus ID: 236965657

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

@article{Jentzen2021APO,
  title={A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions},
  author={Arnulf Jentzen and Adrian Riekert},
  journal={ArXiv},
  year={2021},
  volume={abs/2108.04620}
}
Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Despite the great success of GD type optimization methods in numerical simulations for the training of ANNs with ReLU activation, it remains – even in the simplest situation of the plain vanilla GD optimization method with random initializations and ANNs with one hidden layer – an open problem to prove (or disprove) the conjecture… Expand
1 Citations
Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation
TLDR
Two basic results for GF differential equations are established in the training of fully-connected feedforward ANNs with one hidden layer and ReLU activation under the assumption that the probability distribution of the input data of the considered supervised learning problem is absolutely continuous with a bounded density function. Expand

References

SHOWING 1-10 OF 59 REFERENCES
A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions
TLDR
This article proves in the training of rectified fully-connected feedforward ANNs with one-hidden layer that the risk function of the gradient descent method does indeed converge to zero in the special situation where the target function under consideration is a constant function. Expand
Convergence analysis for gradient flows in the training of artificial neural networks with ReLU activation
TLDR
In the case of a 1-dimensional affine linear target function and in the case where the probability distribution of the input data coincides with the standard uniform distribution that the risk of every bounded GF trajectory converges to zero if the initial risk is sufficiently small, this article proves in the cases where the target function is possibly multi-dimensional and continuous. Expand
A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions
TLDR
Stochastic gradient descent optimization method in the training of fully-connected feedforward artificial neural networks with ReLU activation proves that the risk of the SGD process converges to zero if the target function under consideration is constant. Expand
Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks
TLDR
This work analyzes for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss and proves that two conditions which guarantee efficient convergence from random initializations do in fact hold, under the assumptions of nondegenerate inputs and overparameterization. Expand
Gradient descent optimizes over-parameterized deep ReLU networks
TLDR
The key idea of the proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. Expand
Strong error analysis for stochastic gradient descent optimization algorithms
Stochastic gradient descent (SGD) optimization algorithms are key ingredients in a series of machine learning applications. In this article we perform a rigorous strong error analysis for SGDExpand
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
TLDR
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. Expand
Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates
TLDR
This article establishes for every $\gamma, \nu \in (0,\infty)$ essentially matching lower and upper bounds for the mean square error of the SGD process with learning rates associated to a simple quadratic stochastic optimization problem. Expand
A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics
TLDR
In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels. Expand
Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks
TLDR
It is shown that in the limit that the number of parameters $n$ is large, the landscape of the mean-squared error becomes convex and the representation error in the function scales as $O(n^{-1})$. Expand
...
1
2
3
4
5
...