• Corpus ID: 238634521

Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression

@article{Wu2021LastIR,
  title={Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression},
  author={Jingfeng Wu and Difan Zou and Vladimir Braverman and Quanquan Gu and Sham M. Kakade},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.06198}
}
  • Jingfeng Wu, Difan Zou, +2 authors S. Kakade
  • Published 12 October 2021
  • Computer Science, Mathematics
  • ArXiv
Stochastic gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications. In practice, one often runs SGD with a geometrically decaying stepsize, i.e., a constant initial stepsize followed by multiple geometric stepsize decay, and uses the last iterate as the output. This kind of SGD is known to be nearly minimax optimal for classical finite dimensional linear regression problems (Ge et al., 2019), and provably outperforms SGD with polynomially decaying… 
1 Citations

Figures from this paper

On the Double Descent of Random Features Models Trained with SGD
TLDR
The theoretical results demonstrate that, with SGD training, RF regression still generalizes well for interpolation learning, and is able to characterize the double descent behavior by the unimodality of variance and monotonic decrease of bias.

References

SHOWING 1-10 OF 26 REFERENCES
The Benefits of Implicit Regularization from SGD in Least Squares Problems
TLDR
The results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances.
Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification
TLDR
A novel analysis is developed in bounding these operators to characterize the excess risk of communication efficient parallelization schemes such as model-averaging/parameter mixing methods, which are of broader interest in analyzing computational aspects of stochastic approximation.
Last iterate convergence of SGD for Least-Squares in the Interpolation regime
TLDR
This work studies the noiseless model in the fundamental least-squares setup and gives explicit non-asymptotic convergence rates in the over-parameterized setting and leverage a fine-grained parameterization of the problem to exhibit polynomial rates that can be faster than $O(1/T)$.
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization
TLDR
This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis.
Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression
We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite
Stochastic algorithms with geometric step decay converge linearly on sharp functions
TLDR
For a large class of stochastic, sharp, nonsmooth, and nonconvex problems, a geometric step decay schedule endows well-known algorithms with a local linear rate of convergence to global minimizers.
The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure
TLDR
This work examines the step-decay schedule for the stochastic optimization problem of streaming least squares regression (both in the non-strongly convex and strongly convex case), where it shows that a sharp theoretical characterization of an optimal learning rate schedule is far more nuanced than suggested by previous work.
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which
Benign overfitting in ridge regression
Classical learning theory suggests that strong regularization is needed to learn a class with large complexity. This intuition is in contrast with the modern practice of machine learning, in
Benign overfitting in linear regression
TLDR
A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.
...
1
2
3
...