Corpus ID: 232092331

# Acceleration via Fractal Learning Rate Schedules

@inproceedings{Agarwal2021AccelerationVF,
title={Acceleration via Fractal Learning Rate Schedules},
author={Naman Agarwal and Surbhi Goel and Cyril Zhang},
booktitle={ICML},
year={2021}
}
• Published in ICML 1 March 2021
• Computer Science, Mathematics
In practical applications of iterative first-order optimization, the learning rate schedule remains notoriously difficult to understand and expensive to tune. We demonstrate the presence of these subtleties even in the innocuous case when the objective is a convex quadratic. We reinterpret an iterative algorithm from the numerical analysis literature as what we call the Chebyshev learning rate schedule for accelerating vanilla gradient descent, and show that the problem of mitigating… Expand
2 Citations

#### Figures from this paper

Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms
• Computer Science, Mathematics
• ArXiv
• 2021
It is proved that the generalization error of a stochastic optimization algorithm can be bounded based on the ‘complexity’ of the fractal structure that underlies its invariant measure. Expand
Super-Acceleration with Cyclical Step-sizes
• Mathematics
• 2021
We develop a convergence-rate analysis of momentum with cyclical step-sizes. We show that under some assumption on the spectral gap of Hessians in machine learning, cyclical step-sizes are provablyExpand

#### References

SHOWING 1-10 OF 98 REFERENCES
Super-Convergence with an Unstable Learning Rate
This note introduces a simple scenario where an unstable learning rate scheme leads to a super fast convergence, with the convergence rate depending only logarithmically on the condition number of the problem. Expand
The order of choice of the iteration parameters in the cyclic Chebyshev iteration method
• Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki,
• 1971
Cyclical Learning Rates for Training Neural Networks
• Leslie N. Smith
• Computer Science
• 2017 IEEE Winter Conference on Applications of Computer Vision (WACV)
• 2017
A new method for setting the learning rate, named cyclical learning rates, is described, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates. Expand
Iterative methods for optimization
• C. Kelley
• Mathematics, Computer Science
• Frontiers in applied mathematics
• 1999
Iterative Methods for Optimization does more than cover traditional gradient-based optimization: it is the first book to treat sampling methods, including the Hooke& Jeeves, implicit filtering, MDS, and Nelder& Mead schemes in a unified way. Expand
Iterative methods for optimization. SIAM
• 1999
Acceleration Methods
• Mathematics, Computer Science
• ArXiv
• 2021
This monograph covers some recent advances on a range of acceleration techniques frequently used in convex optimization, and discusses restart schemes, a set of simple techniques to reach nearly optimal convergence rates while adapting to unobserved regularity parameters. Expand
Characterizing Structural Regularities of Labeled Data in Overparameterized Models
• Computer Science, Mathematics
• ICML
• 2021
Two applications using C-scores to help understand the dynamics of representation learning and filter out outliers are concluded, and discussions of other potential applications such as curriculum learning, and active data collection are discussed. Expand
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
• Computer Science, Mathematics
• ICLR
• 2021
It is empirically demonstrated that full-batch gradient descent on neural network training objectives typically operates in a regime the authors call the Edge of Stability, which is inconsistent with several widespread presumptions in the field of optimization. Expand
A Fast Anderson-Chebyshev Acceleration for Nonlinear Optimization
• Mathematics, Computer Science
• AISTATS
• 2020
It is shown that Anderson acceleration with Chebyshev polynomial can achieve the optimal convergence rate, which improves the previous result $O(\kappa\ln\frac{1}{\epsilon})$ provided by (Toth and Kelley, 2015) for quadratic functions. Expand
A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent
• Mathematics, Computer Science
• AISTATS
• 2020
A unified analysis of a large family of variants of proximal stochastic gradient descent, which so far have required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities is introduced. Expand