Non-convergence of stochastic gradient descent in the training of deep neural networks

@article{Cheridito2021NonconvergenceOS,
  title={Non-convergence of stochastic gradient descent in the training of deep neural networks},
  author={Patrick Cheridito and Arnulf Jentzen and Florian Rossmannek},
  journal={ArXiv},
  year={2021},
  volume={abs/2006.07075}
}
Discrete Gradient Flow Approximations of High Dimensional Evolution Partial Differential Equations via Deep Neural Networks
TLDR
A series of numerical experiments are presented which showcase the good performance of Dirichlet-type energy approximations for lower space dimensions and the excellent performance of the JKO-type energies for higher spatial dimensions.
Supply Chain Management Optimization and Prediction Model Based on Projected Stochastic Gradient
Supply chain management (SCM) is considered at the forefront of many organizations in the delivery of their products. Various optimization methods are applied in the SCM to improve the efficiency of
On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems
TLDR
It is shown that the loss landscape of training problems for deep artificial neural networks with a one-dimensional real output whose activation functions contain an affine segment and whose hidden layers have width at least two possess a continuum of spurious local minima for all target functions that are not affine.
Stochastic Weight Averaging Revisited
TLDR
It is shown that PSWA outperforms its backbone SGD remarkably during the early stage of the SGD sampling process, and thus it is demonstrated that the hypothesis that there are global scale geometric structures in the DNN loss landscape that can be discovered by an SGD agent at theEarly stage of its working period can be exploited by the WA operation.
Constructive Deep ReLU Neural Network Approximation
TLDR
An efficient, deterministic algorithm for constructing exponentially convergent deep neural network approximations of multivariate, analytic maps f : [ - 1,1] K → R and it is proved exponential convergence of expression and generalization errors of the constructed ReLU DNNs.
Deep multimodal autoencoder for crack criticality assessment
TLDR
It is shown that latent variables forecasted by the images of defects are prone to a better understanding of the predictions, and enables projection‐based model order reduction as proposed in the study of Lee and Carlberg.
Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions
TLDR
It is proved under the assumption that the learning rates of the SGD optimization method are sufficiently small but not L-summable that the expectation of the risk of the considered SGD process converges in the training of such DNNs to zero as the number of SGD steps increases to infinity.
Enhancement of Multilayer Perceptron Model Training Accuracy through the Optimization of Hyperparameters: A Case Study of the Quality Prediction of Injection Molded Parts
TLDR
Stochastic gradient ascent (SGD) and stochastic gradient descent with momentum were used to optimize the artificial neural network model and through optimization of these training model hyperparameters, the width testing accuracy of the injection product improved.
...
...

References

SHOWING 1-10 OF 62 REFERENCES
Dying ReLU and Initialization: Theory and Numerical Examples
TLDR
This paper rigorously proves that a deep ReLU network will eventually die in probability as the depth goes to infinite, and proposes a new initialization procedure, namely, a randomized asymmetric initialization, which can effectively prevent the dying ReLU.
Trainability and Data-dependent Initialization of Overparameterized
  • ReLU Neural Networks
  • 2019
TRAINABILITY OF ReLU NETWORKS AND DATA-DEPENDENT INITIALIZATION
TLDR
This paper studies the trainability of rectified linear unit (ReLU) networks at initialization, and shows that overparameterization is both a necessary and a sufficient condition for achieving a zero training loss.
Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation
TLDR
This article provides a mathematically rigorous full error analysis of deep learning based empirical risk minimisation with quadratic loss function in the probabilistically strong sense, where the underlying deep neural networks are trained using stochastic gradient descent with random initialisation.
How implicit regularization of Neural Networks affects the learned function - Part I
TLDR
One dimensional ReLU neural networks in which weights are chosen randomly and only the terminal layer is trained are considered, and it is shown that the resulting solution converges to the smooth spline interpolation of the training data as the number of hidden nodes tends to infinity.
Full error analysis for the training of deep neural networks
TLDR
The main contribution of this work is to provide a full error analysis which covers each of the three different sources of errors usually emerging in deep learning algorithms and which merges these three Sources of errors into one overall error estimate for the considered deep learning algorithm.
Trainability and Data-dependent Initialization of Over-parameterized ReLU Neural Networks
TLDR
This paper says a network is trainable if the number of active neurons is sufficiently large for a learning task, and proposes a data-dependent initialization method in the over-parameterized setting.
Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks
TLDR
This work analyzes for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss and proves that two conditions which guarantee efficient convergence from random initializations do in fact hold, under the assumptions of nondegenerate inputs and overparameterization.
How degenerate is the parametrization of neural networks with the ReLU activation function?
TLDR
The pathologies which prevent inverse stability in general are presented, and it is shown that by optimizing over such restricted sets, it is still possible to learn any function which can be learned by optimization over unrestricted sets.
...
...