• Corpus ID: 235458507

Super-Acceleration with Cyclical Step-sizes

@inproceedings{Goujaud2022SuperAccelerationWC,
  title={Super-Acceleration with Cyclical Step-sizes},
  author={Baptiste Goujaud and Damien Scieur and Aymeric Dieuleveut and Adrien B. Taylor and Fabian Pedregosa},
  booktitle={AISTATS},
  year={2022}
}
We develop a convergence-rate analysis of momentum with cyclical step-sizes. We show that under some assumption on the spectral gap of Hessians in machine learning, cyclical step-sizes are provably faster than constant step-sizes. More precisely, we develop a convergence rate analysis for quadratic objectives that provides optimal parameters and shows that cyclical learning rates can improve upon traditional lower complexity bounds. We further propose a systematic approach to design optimal… 

Figures and Tables from this paper

Fast Benchmarking of Accuracy vs. Training Time with Cyclic Learning Rates

TLDR
A multiplicative cyclic learning rate schedule can be used to construct a tradeoff curve in a single training run to evaluate the effects of algorithmic choices on network training efficiency.

Branch-and-Bound Performance Estimation Programming: A Unified Methodology for Constructing Optimal Optimization Methods

TLDR
The BnB-PEP methodology is applied to several setups for which the prior methodologies do not apply and obtain methods with bounds that improve upon prior state-of-the-art results, thereby systematically generating analytical convergence proofs.

References

SHOWING 1-10 OF 36 REFERENCES

Acceleration via Fractal Learning Rate Schedules

TLDR
This work reinterprets an iterative algorithm from the numerical analysis literature as what it is called the Chebyshev learning rate schedule for accelerating vanilla gradient descent, and shows that the problem of mitigating instability leads to a fractal ordering of step sizes.

Super-Convergence with an Unstable Learning Rate

TLDR
This note introduces a simple scenario where an unstable learning rate scheme leads to a super fast convergence, with the convergence rate depending only logarithmically on the condition number of the problem.

Explaining the Adaptive Generalisation Gap

TLDR
It is demonstrated that typical schedules used for adaptive methods (with low numerical stability or damping constants) serve to bias relative movement towards flat directions relative to sharp directions, effectively amplifying the noise-to-signal ratio and harming generalisation.

The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size.

TLDR
The results corroborate previous findings, based on small-scale networks, that the Hessian exhibits "spiked" behavior, with several outliers isolated from a continuous bulk.

Cyclical Learning Rates for Training Neural Networks

  • L. Smith
  • Computer Science
    2017 IEEE Winter Conference on Applications of Computer Vision (WACV)
  • 2017
TLDR
A new method for setting the learning rate, named cyclical learning rates, is described, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates.

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process. Using this, we study a

SGDR: Stochastic Gradient Descent with Warm Restarts

TLDR
This paper proposes a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks and empirically studies its performance on the CIFAR-10 and CIFARS datasets.

Global convergence of the Heavy-ball method for convex optimization

This paper establishes global convergence and provides global bounds of the rate of convergence for the Heavy-ball method for convex optimization. When the objective function has Lipschitz-continuous

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

TLDR
A case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but are in fact connected through their flat region and so belong to the same basin.

On the importance of initialization and momentum in deep learning

TLDR
It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.