• Corpus ID: 239998500

Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums

  title={Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums},
  author={Ruimin Pan and Haishan Ye and Tong Zhang},
Learning rate schedulers have been widely adopted in training deep neural networks. Despite their practical importance, there is a discrepancy between its practice and its theoretical analysis. For instance, it is not known what schedules of SGD achieve best convergence, even for simple problems such as optimizing quadratic objectives. So far, step decay has been one of the strongest candidates under this setup, which is proved to be nearly optimal with a O(log T ) gap. However, according to… 


The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure
This work examines the step-decay schedule for the stochastic optimization problem of streaming least squares regression (both in the non-strongly convex and strongly convex case), where it shows that a sharp theoretical characterization of an optimal learning rate schedule is far more nuanced than suggested by previous work.
A Second look at Exponential and Cosine Step Sizes: Simplicity, Convergence, and Performance
Results show that the exponential and the cosine step sizes, even if only requiring at most two hyperparameters to tune, best or match the performance of various finely-tuned state-of-the-art strategies.
Competing with the Empirical Risk Minimizer in a Single Pass
This work provides a simple streaming algorithm which, under standard regularity assumptions on the underlying problem, enjoys the following properties: * The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample.
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which
Optimization Methods for Large-Scale Machine Learning
A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning.
SGDR: Stochastic Gradient Descent with Warm Restarts
This paper proposes a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks and empirically studies its performance on the CIFAR-10 and CIFARS datasets.
A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation
Empirical analysis suggests that the reasons often quoted for the success of cosine annealing are not evidenced in practice, and that the effect of learning rate warmup is to prevent the deeper layers from creating training instability.
Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging
This work presents the first tight non-asymptotic generalization error bounds for these schemes for the stochastic approximation problem of least squares regression, and establishes a precise problem-dependent extent to which mini-batching can be used to yield provable near-linear parallelization speedups over SGD with batch size one.
On Exact Computation with an Infinitely Wide Neural Net
The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.
Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions
This work considers the least-squares regression problem and provides a detailed asymptotic analysis of the performance of averaged constant-step-size stochastic gradient descent, and provides an asymPTotic expansion up to explicit exponentially decaying terms.