# Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression

@article{Wu2021LastIR, title={Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression}, author={Jingfeng Wu and Difan Zou and Vladimir Braverman and Quanquan Gu and Sham M. Kakade}, journal={ArXiv}, year={2021}, volume={abs/2110.06198} }

Stochastic gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications. In practice, one often runs SGD with a geometrically decaying stepsize, i.e., a constant initial stepsize followed by multiple geometric stepsize decay, and uses the last iterate as the output. This kind of SGD is known to be nearly minimax optimal for classical finite dimensional linear regression problems (Ge et al., 2019), and provably outperforms SGD with polynomially decaying…

## One Citation

On the Double Descent of Random Features Models Trained with SGD

- Mathematics, Computer ScienceArXiv
- 2021

The theoretical results demonstrate that, with SGD training, RF regression still generalizes well for interpolation learning, and is able to characterize the double descent behavior by the unimodality of variance and monotonic decrease of bias.

## References

SHOWING 1-10 OF 26 REFERENCES

The Benefits of Implicit Regularization from SGD in Least Squares Problems

- Computer Science, MathematicsArXiv
- 2021

The results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances.

Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2017

A novel analysis is developed in bounding these operators to characterize the excess risk of communication efficient parallelization schemes such as model-averaging/parameter mixing methods, which are of broader interest in analyzing computational aspects of stochastic approximation.

Last iterate convergence of SGD for Least-Squares in the Interpolation regime

- Computer Science, MathematicsArXiv
- 2021

This work studies the noiseless model in the fundamental least-squares setup and gives explicit non-asymptotic convergence rates in the over-parameterized setting and leverage a fine-grained parameterization of the problem to exhibit polynomial rates that can be faster than $O(1/T)$.

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

- Mathematics, Computer ScienceICML
- 2012

This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis.

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

- Mathematics, Computer ScienceJ. Mach. Learn. Res.
- 2017

We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite…

Stochastic algorithms with geometric step decay converge linearly on sharp functions

- Mathematics, Computer ScienceArXiv
- 2019

For a large class of stochastic, sharp, nonsmooth, and nonconvex problems, a geometric step decay schedule endows well-known algorithms with a local linear rate of convergence to global minimizers.

The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure

- Mathematics, Computer ScienceNeurIPS
- 2019

This work examines the step-decay schedule for the stochastic optimization problem of streaming least squares regression (both in the non-strongly convex and strongly convex case), where it shows that a sharp theoretical characterization of an optimal learning rate schedule is far more nuanced than suggested by previous work.

Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)

- Computer Science, MathematicsNIPS
- 2013

We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which…

Benign overfitting in ridge regression

- Mathematics
- 2020

Classical learning theory suggests that strong regularization is needed to learn a class with large complexity. This intuition is in contrast with the modern practice of machine learning, in…

Benign overfitting in linear regression

- Computer Science, MathematicsProceedings of the National Academy of Sciences
- 2020

A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.