Non-convergence of stochastic gradient descent in the training of deep neural networks

@article{Cheridito2021NonconvergenceOS,
  title={Non-convergence of stochastic gradient descent in the training of deep neural networks},
  author={Patrick Cheridito and Arnulf Jentzen and Florian Rossmannek},
  journal={ArXiv},
  year={2021},
  volume={abs/2006.07075}
}
Constructive Deep ReLU Neural Network Approximation
TLDR
An efficient, deterministic algorithm for constructing exponentially convergent deep neural network approximations of multivariate, analytic maps f : [ - 1,1] K → R and it is proved exponential convergence of expression and generalization errors of the constructed ReLU DNNs.
On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems
TLDR
It is shown that the loss landscape of training problems for deep artificial neural networks with a one-dimensional real output whose activation functions contain an affine segment and whose hidden layers have width at least two possess a continuum of spurious local minima for all target functions that are not affine.
Stochastic Weight Averaging Revisited
TLDR
It is characterized that SWA’s performance is highly dependent on to what extent the SGD process that runs before SWA converges, and the operation of weight averaging only contributes to variance reduction.
Supply Chain Management Optimization and Prediction Model Based on Projected Stochastic Gradient
Supply chain management (SCM) is considered at the forefront of many organizations in the delivery of their products. Various optimization methods are applied in the SCM to improve the efficiency of
A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions
TLDR
Stochastic gradient descent optimization method in the training of fully-connected feedforward artificial neural networks with ReLU activation proves that the risk of the SGD process converges to zero if the target function under consideration is constant.
A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions
TLDR
This article proves the conjecture that the risk of the GD optimization method converges in the training of such ANNs to zero as the width of the ANNs, the number of independent random initializations, and the numberof GD steps increase to infinity in the situation where the probability distribution of the input data is equivalent to the continuous uniform distribution on a compact interval.
Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes
TLDR
It is shown that for neural networks with analytic activation function such as softplus, sigmoid and the hyperbolic tangent, SGD converges on the event of staying local, if the random variables modeling the signal and response in the training are compactly supported.
Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions
TLDR
It is proved under the assumption that the learning rates of the SGD optimization method are sufficiently small but not L-summable that the expectation of the risk of the considered SGD process converges in the training of such DNNs to zero as the number of SGD steps increases to infinity.
Convergence rates for gradient descent in the training of overparameterized artificial neural networks with biases
TLDR
This paper presents a meta-modelling system that automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and cataloging data elements of a distributed system.
...
1
2
...

References

SHOWING 1-10 OF 62 REFERENCES
Dying ReLU and Initialization: Theory and Numerical Examples
TLDR
This paper rigorously proves that a deep ReLU network will eventually die in probability as the depth goes to infinite, and proposes a new initialization procedure, namely, a randomized asymmetric initialization, which can effectively prevent the dying ReLU.
Trainability and Data-dependent Initialization of Overparameterized
  • ReLU Neural Networks
  • 2019
Topological Properties of the Set of Functions Generated by Neural Networks of Fixed Size
TLDR
Overall, the findings identify potential causes for issues in the training procedure of deep learning such as no guaranteed convergence, explosion of parameters, and slow convergence.
A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics
TLDR
In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels.
Convergence rates for the stochastic gradient descent method for non-convex objective functions
We prove the local convergence to minima and estimates on the rate of convergence for the stochastic gradient descent method in the case of not necessarily globally convex nor contracting objective
Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation
TLDR
This article provides a mathematically rigorous full error analysis of deep learning based empirical risk minimisation with quadratic loss function in the probabilistically strong sense, where the underlying deep neural networks are trained using stochastic gradient descent with random initialisation.
Stochastic Gradient Descent for Nonconvex Learning Without Bounded Gradient Assumptions
TLDR
This article establishes a rigorous theoretical foundation for SGD in nonconvex learning by showing that this boundedness assumption can be removed without affecting convergence rates, and relaxing the standard smoothness assumption to Hölder continuity of gradients.
TRAINABILITY OF ReLU NETWORKS AND DATA-DEPENDENT INITIALIZATION
TLDR
This paper studies the trainability of rectified linear unit (ReLU) networks at initialization, and shows that overparameterization is both a necessary and a sufficient condition for achieving a zero training loss.
The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent
TLDR
It is observed that the combination of batch normalization and skip connections reduces gradient confusion, which helps reduce the training burden of very deep networks with Gaussian initializations.
...
1
2
3
4
5
...