# Non-convergence of stochastic gradient descent in the training of deep neural networks

@article{Cheridito2021NonconvergenceOS, title={Non-convergence of stochastic gradient descent in the training of deep neural networks}, author={Patrick Cheridito and Arnulf Jentzen and Florian Rossmannek}, journal={ArXiv}, year={2021}, volume={abs/2006.07075} }

## 15 Citations

A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions

- Computer Science, MathematicsJournal of Complexity
- 2022

Constructive Deep ReLU Neural Network Approximation

- Computer Science, MathematicsJ. Sci. Comput.
- 2022

An efficient, deterministic algorithm for constructing exponentially convergent deep neural network approximations of multivariate, analytic maps f : [ - 1,1] K → R and it is proved exponential convergence of expression and generalization errors of the constructed ReLU DNNs.

On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems

- Computer ScienceArXiv
- 2022

It is shown that the loss landscape of training problems for deep artificial neural networks with a one-dimensional real output whose activation functions contain an affine segment and whose hidden layers have width at least two possess a continuum of spurious local minima for all target functions that are not affine.

Stochastic Weight Averaging Revisited

- Computer ScienceArXiv
- 2022

It is characterized that SWA’s performance is highly dependent on to what extent the SGD process that runs before SWA converges, and the operation of weight averaging only contributes to variance reduction.

Supply Chain Management Optimization and Prediction Model Based on Projected Stochastic Gradient

- BusinessSustainability
- 2022

Supply chain management (SCM) is considered at the forefront of many organizations in the delivery of their products. Various optimization methods are applied in the SCM to improve the efficiency of…

A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions

- Computer ScienceArXiv
- 2021

Stochastic gradient descent optimization method in the training of fully-connected feedforward artificial neural networks with ReLU activation proves that the risk of the SGD process converges to zero if the target function under consideration is constant.

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

- Mathematics, Computer ScienceArXiv
- 2021

This article proves the conjecture that the risk of the GD optimization method converges in the training of such ANNs to zero as the width of the ANNs, the number of independent random initializations, and the numberof GD steps increase to infinity in the situation where the probability distribution of the input data is equivalent to the continuous uniform distribution on a compact interval.

Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes

- Computer ScienceArXiv
- 2021

It is shown that for neural networks with analytic activation function such as softplus, sigmoid and the hyperbolic tangent, SGD converges on the event of staying local, if the random variables modeling the signal and response in the training are compactly supported.

Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions

- Computer ScienceArXiv
- 2021

It is proved under the assumption that the learning rates of the SGD optimization method are sufficiently small but not L-summable that the expectation of the risk of the considered SGD process converges in the training of such DNNs to zero as the number of SGD steps increases to infinity.

Convergence rates for gradient descent in the training of overparameterized artificial neural networks with biases

- Computer ScienceArXiv
- 2021

This paper presents a meta-modelling system that automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and cataloging data elements of a distributed system.

## References

SHOWING 1-10 OF 62 REFERENCES

Dying ReLU and Initialization: Theory and Numerical Examples

- Computer ScienceArXiv
- 2019

This paper rigorously proves that a deep ReLU network will eventually die in probability as the depth goes to infinite, and proposes a new initialization procedure, namely, a randomized asymmetric initialization, which can effectively prevent the dying ReLU.

Trainability and Data-dependent Initialization of Overparameterized

- ReLU Neural Networks
- 2019

Topological Properties of the Set of Functions Generated by Neural Networks of Fixed Size

- Computer Science, MathematicsFound. Comput. Math.
- 2021

Overall, the findings identify potential causes for issues in the training procedure of deep learning such as no guaranteed convergence, explosion of parameters, and slow convergence.

A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics

- Computer ScienceScience China Mathematics
- 2020

In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels.

Convergence rates for the stochastic gradient descent method for non-convex objective functions

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2020

We prove the local convergence to minima and estimates on the rate of convergence for the stochastic gradient descent method in the case of not necessarily globally convex nor contracting objective…

Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates

- Computer ScienceJ. Complex.
- 2020

Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation

- Computer ScienceArXiv
- 2020

This article provides a mathematically rigorous full error analysis of deep learning based empirical risk minimisation with quadratic loss function in the probabilistically strong sense, where the underlying deep neural networks are trained using stochastic gradient descent with random initialisation.

Stochastic Gradient Descent for Nonconvex Learning Without Bounded Gradient Assumptions

- Computer ScienceIEEE Transactions on Neural Networks and Learning Systems
- 2020

This article establishes a rigorous theoretical foundation for SGD in nonconvex learning by showing that this boundedness assumption can be removed without affecting convergence rates, and relaxing the standard smoothness assumption to Hölder continuity of gradients.

TRAINABILITY OF ReLU NETWORKS AND DATA-DEPENDENT INITIALIZATION

- Computer Science
- 2020

This paper studies the trainability of rectified linear unit (ReLU) networks at initialization, and shows that overparameterization is both a necessary and a sufficient condition for achieving a zero training loss.

The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent

- Computer ScienceICML
- 2020

It is observed that the combination of batch normalization and skip connections reduces gradient confusion, which helps reduce the training burden of very deep networks with Gaussian initializations.