Corpus ID: 233714813

Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis

@article{Wojtowytsch2021StochasticGD,
  title={Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis},
  author={Stephan Wojtowytsch},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.01650}
}
Stochastic gradient descent (SGD) is one of the most popular algorithms in modern machine learning. The noise encountered in these applications is different from that in many theoretical analyses of stochastic gradient algorithms. In this article, we discuss some of the common properties of energy landscapes and stochastic noise encountered in machine learning problems, and how they affect SGD-based optimization. In particular, we show that the learning rate in SGD with machine learning noise… Expand
3 Citations

Figures from this paper

Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis
TLDR
In a continuous time model for SGD with noise that follows the ‘machine learning scaling’, it is shown that in a certain noise regime, the optimization algorithm prefers ‘flat’ minima of the objective function in a sense which is different from the flat minimum selection of continuous timeSGD with homogeneous noise. Expand
A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions
TLDR
This article proves the conjecture that the risk of the GD optimization method converges in the training of such ANNs to zero as the width of the ANNs, the number of independent random initializations, and the numberof GD steps increase to infinity in the situation where the probability distribution of the input data is equivalent to the continuous uniform distribution on a compact interval. Expand
Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity
TLDR
The findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances observed in practice of stochastic gradient descent over gradient descent. Expand

References

SHOWING 1-10 OF 36 REFERENCES
Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis
TLDR
In a continuous time model for SGD with noise that follows the ‘machine learning scaling’, it is shown that in a certain noise regime, the optimization algorithm prefers ‘flat’ minima of the objective function in a sense which is different from the flat minimum selection of continuous timeSGD with homogeneous noise. Expand
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization
TLDR
This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis. Expand
Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
TLDR
This work provides a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent as well as a simple modification where iterates are averaged, suggesting that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate, is not robust to the lack of strong convexity or the setting of the proportionality constant. Expand
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework whichExpand
Strong error analysis for stochastic gradient descent optimization algorithms
Stochastic gradient descent (SGD) optimization algorithms are key ingredients in a series of machine learning applications. In this article we perform a rigorous strong error analysis for SGDExpand
Optimization Methods for Large-Scale Machine Learning
TLDR
A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning. Expand
AdaGrad stepsizes: sharp convergence over nonconvex landscapes
TLDR
The norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the O(log(N)/ √ N) rate in the stochastic setting, and at the optimal O(1/N) rates in the batch (non-stochastic) setting – in this sense, the convergence guarantees are “sharp”. Expand
Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron
TLDR
It is proved that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions. Expand
On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems
This paper analyzes the trajectories of stochastic gradient descent (SGD) to help understand the algorithm's convergence properties in non-convex problems. We first show that the sequence of iteratesExpand
Linear Convergence of Adaptive Stochastic Gradient Descent
We prove that the norm version of the adaptive stochastic gradient method (AdaGrad-Norm) achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functionsExpand
...
1
2
3
4
...