# Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums

@article{Pan2021EigencurveOL, title={Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums}, author={Ruimin Pan and Haishan Ye and Tong Zhang}, journal={ArXiv}, year={2021}, volume={abs/2110.14109} }

Learning rate schedulers have been widely adopted in training deep neural networks. Despite their practical importance, there is a discrepancy between its practice and its theoretical analysis. For instance, it is not known what schedules of SGD achieve best convergence, even for simple problems such as optimizing quadratic objectives. So far, step decay has been one of the strongest candidates under this setup, which is proved to be nearly optimal with a O(log T ) gap. However, according to…

## Figures and Tables from this paper

## References

SHOWING 1-10 OF 32 REFERENCES

The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure

- Mathematics, Computer ScienceNeurIPS
- 2019

This work examines the step-decay schedule for the stochastic optimization problem of streaming least squares regression (both in the non-strongly convex and strongly convex case), where it shows that a sharp theoretical characterization of an optimal learning rate schedule is far more nuanced than suggested by previous work.

A Second look at Exponential and Cosine Step Sizes: Simplicity, Convergence, and Performance

- Computer Science
- 2020

Results show that the exponential and the cosine step sizes, even if only requiring at most two hyperparameters to tune, best or match the performance of various finely-tuned state-of-the-art strategies.

Competing with the Empirical Risk Minimizer in a Single Pass

- Computer Science, MathematicsCOLT
- 2015

This work provides a simple streaming algorithm which, under standard regularity assumptions on the underlying problem, enjoys the following properties: * The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample.

Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)

- Computer Science, MathematicsNIPS
- 2013

We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which…

Optimization Methods for Large-Scale Machine Learning

- Computer Science, MathematicsSIAM Rev.
- 2018

A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning.

SGDR: Stochastic Gradient Descent with Warm Restarts

- Computer Science, MathematicsICLR
- 2017

This paper proposes a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks and empirically studies its performance on the CIFAR-10 and CIFARS datasets.

A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

- Computer Science, MathematicsICLR
- 2019

Empirical analysis suggests that the reasons often quoted for the success of cosine annealing are not evidenced in practice, and that the effect of learning rate warmup is to prevent the deeper layers from creating training instability.

Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging

- Computer Science, MathematicsArXiv
- 2016

This work presents the first tight non-asymptotic generalization error bounds for these schemes for the stochastic approximation problem of least squares regression, and establishes a precise problem-dependent extent to which mini-batching can be used to yield provable near-linear parallelization speedups over SGD with batch size one.

On Exact Computation with an Infinitely Wide Neural Net

- Computer Science, MathematicsNeurIPS
- 2019

The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.

Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions

- Mathematics, Computer ScienceAISTATS
- 2015

This work considers the least-squares regression problem and provides a detailed asymptotic analysis of the performance of averaged constant-step-size stochastic gradient descent, and provides an asymPTotic expansion up to explicit exponentially decaying terms.