• Corpus ID: 235391082

A Second look at Exponential and Cosine Step Sizes: Simplicity, Adaptivity, and Performance

@inproceedings{Li2021ASL,
  title={A Second look at Exponential and Cosine Step Sizes: Simplicity, Adaptivity, and Performance},
  author={Xiaoyun Li and Zhenxun Zhuang and Francesco Orabona},
  booktitle={ICML},
  year={2021}
}
Stochastic Gradient Descent (SGD) is a popular tool in training large-scale machine learning models. Its performance, however, is highly variable, depending crucially on the choice of the step sizes. Accordingly, a variety of strategies for tuning the step sizes have been proposed, ranging from coordinate-wise approaches (a.k.a. “adaptive” step sizes) to sophisticated heuristics to change the step size in each iteration. In this paper, we study two step size schedules whose power has been… 

Figures and Tables from this paper

On the Convergence of Step Decay Step-Size for Stochastic Optimization
TLDR
This work provides convergence results for step decay in the non-convex regime, ensuring that the gradient norm vanishes at an O (ln T/ √ T ) rate and provides near-optimal convergence guarantees for general, possibly non-smooth, convex and strongly convex problems.
On Uniform Boundedness Properties of SGD and its Momentum Variants
A theoretical, and potentially also practical, problem with stochastic gradient descent is that trajectories may escape to infinity. In this note, we investigate uniform boundedness properties of
An Optimization Framework for Federated Edge Learning
TLDR
This paper presents a general FL algorithm, namely GenQSGD+, whose parameters include the numbers of global and local iterations, mini-batch size, and step size sequence, and analyzes the convergence of GenQ SGD+ with arbitrary algorithm parameters.
Bandwidth-based Step-Sizes for Non-Convex Stochastic Optimization
TLDR
This work provides worst-case guarantees for SGD on smooth non-convex problems under several bandwidth-based step sizes, including stagewise 1/ √ t and the popular step-decay (“constant and then drop by a constant”), which is shown to be optimal.
On the Convergence of Step Decay Step-Size for Stochastic Optimization
TLDR
This work provides convergence results for step decay in the non-convex regime, ensuring that the gradient norm vanishes at an O (ln T/ √ T ) rate and provides near-optimal convergence guarantees for general, possibly non-smooth, convex and strongly convex problems.

References

SHOWING 1-10 OF 76 REFERENCES
Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates
TLDR
This work proposes to use line-search techniques to automatically set the step-size when training models that can interpolate the data, and proves that SGD with a stochastic variant of the classic Armijo line- search attains the deterministic convergence rates for both convex and strongly-convex functions.
Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron
TLDR
It is proved that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions.
Stochastic algorithms with geometric step decay converge linearly on sharp functions
TLDR
For a large class of stochastic, sharp, nonsmooth, and nonconvex problems, a geometric step decay schedule endows well-known algorithms with a local linear rate of convergence to global minimizers.
Stagewise Training Accelerates Convergence of Testing Error Over SGD
TLDR
This paper considers a stagewise training strategy for minimizing empirical risk that satisfies the Polyak-\L ojasiewicz (PL) condition, which has been observed/proved for neural networks and also holds for a broad family of convex functions.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
TLDR
This work provides a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent as well as a simple modification where iterates are averaged, suggesting that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate, is not robust to the lack of strong convexity or the setting of the proportionality constant.
On the Convergence of Adam and Beyond
TLDR
It is shown that one cause for such failures is the exponential moving average used in the algorithms, and suggested that the convergence issues can be fixed by endowing such algorithms with `long-term memory' of past gradients.
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition
TLDR
This work shows that this much-older Polyak-Lojasiewicz (PL) inequality is actually weaker than the main conditions that have been explored to show linear convergence rates without strong convexity over the last 25 years, leading to simple proofs of linear convergence of these methods.
An Alternative View: When Does SGD Escape Local Minima?
TLDR
SGD will not get stuck at "sharp" local minima with small diameters, as long as the neighborhoods of these regions contain enough gradient information, which helps explain why SGD works so well for neural networks.
Deep Frank-Wolfe For Neural Network Optimization
TLDR
This work presents an optimization method based on a composite proximal framework that exploits the compositional nature of deep neural networks and can leverage powerful convex optimization algorithms by design, and employs the Frank-Wolfe algorithm for SVM, which computes an optimal step-size in closed-form at each time-step.
...
...