Benign Overfitting of Constant-Stepsize SGD for Linear Regression
@article{Zou2021BenignOO, title={Benign Overfitting of Constant-Stepsize SGD for Linear Regression}, author={Difan Zou and Jingfeng Wu and Vladimir Braverman and Quanquan Gu and Sham M. Kakade}, journal={ArXiv}, year={2021}, volume={abs/2103.12692} }
There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see a benign overfitting phenomenon in overparameterized settings for natural learning algorithms, such as stochastic gradient descent (SGD), where little to no explicit regularization has been employed. This work considers this issue in arguably the most basic setting: constant-stepsize SGD (with iterate averaging) for linear regression in the overparameterized…
26 Citations
Relaxing the Feature Covariance Assumption: Time-Variant Bounds for Benign Overfitting in Linear Regression
- Computer ScienceArXiv
- 2022
This paper generalizes the idea of benign overfitting to the whole training trajectory instead of the min-norm solution and derive a time-variant bound based on the trajectory analysis, and derives a time interval that suffices to guarantee a consistent generalization error for a given feature covariance.
When do Models Generalize? A Perspective from Data-Algorithm Compatibility
- Computer Science
- 2022
This work theoretically studies compatibility under the setting of solving overparameterized linear regression with gradient descent, and demonstrates that in the sense of compatibility, generalization holds with significantly weaker restrictions on the problem instance than the previous last iterate analysis.
Provable Generalization of Overparameterized Meta-learning Trained with SGD
- Computer ScienceArXiv
- 2022
This paper studies the generalization of a widely used meta-learning approach, Model-Agnostic Meta-Learning (MAML), which aims to a good initialization for fast adaptation to new tasks under a mixed linear regression model.
Accelerated SGD for Non-Strongly-Convex Least Squares
- Computer ScienceCOLT
- 2022
This work presents the first practical algorithm that achieves the optimal prediction error rates in terms of dependence on the noise of the problem, as O(d/t) while accelerating the forgetting of the initial conditions to O( d/t).
On the Double Descent of Random Features Models Trained with SGD
- Computer ScienceArXiv
- 2021
The theoretical results demonstrate that, with SGD training, RF regression still generalizes well for interpolation learning, and is able to characterize the double descent behavior by the unimodality of variance and monotonic decrease of bias.
Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime
- Computer ScienceArXiv
- 2022
The goal of this paper is to sharply characterize the generalization of multi-pass SGD, by developing an instance-dependent excess risk bound for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance.
Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression
- Computer ScienceICML
- 2022
This paper provides a problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize, for (overparameterized) linear regression problems, and proves nearly matching upper and lower bounds on the excess risk.
The Benefits of Implicit Regularization from SGD in Least Squares Problems
- Computer ScienceNeurIPS
- 2021
The results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances.
Tight Convergence Rate Bounds for Optimization Under Power Law Spectral Conditions
- Computer ScienceArXiv
- 2022
This paper performs a systematic study of a range of classical single-step and multi-step first order optimization algorithms, with adaptive and non-adaptive, constant andNon-constant learning rates, and proves that a power law spectral assumption entails a powerlaw for convergence rate of the algorithm.
The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift
- Computer ScienceArXiv
- 2022
It is shown that finetuning, even with only a small amount of target data, could drastically reduce the amount of source data required by pretraining, and bounds suggest that for a large class of linear regression instances, transfer learning with O ( N 2 ) source data is as effective as supervised learning with N target data.
References
SHOWING 1-10 OF 25 REFERENCES
Benign overfitting in ridge regression
- Computer Science
- 2020
This work provides non-asymptotic generalization bounds for overparametrized ridge regression that depend on the arbitrary covariance structure of the data, and shows that those bounds are tight for a range of regularization parameter values.
Benign overfitting in linear regression
- Computer ScienceProceedings of the National Academy of Sciences
- 2020
A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
- Computer Science, MathematicsNIPS
- 2013
We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which…
A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)
- Computer ScienceFSTTCS
- 2017
This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent, for the special case of least squares, by analyzing SGD as a Stochastic process and sharply characterizing the stationary covariance matrix of this process.
High-Dimensional Probability: An Introduction with Applications in Data Science
- Mathematics
- 2020
© 2018, Cambridge University Press Let us summarize our findings. A random projection of a set T in R n onto an m-dimensional subspace approximately preserves the geometry of T if m ⪆ d ( T ) . For...
Finite-sample analysis of interpolating linear classifiers in the overparameterized regime
- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2021
Borders on the population risk of the maximum margin algorithm for two-class linear classification are proved, and it is shown that, with sufficient over-parameterization, this algorithm trained on noisy data can achieve nearly optimal population risk.
Non-parametric Stochastic Approximation with Large Step sizes
- Computer Science, Mathematics
- 2014
In a stochastic approximation framework, it is shown that the averaged unregularized least-mean-square algorithm, given a sufficient large step-size, attains optimal rates of convergence for a variety of regimes for the smoothnesses of the optimal prediction function and the functions in $\mathcal{H}$.
Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression
- Computer ScienceJ. Mach. Learn. Res.
- 2017
We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite…
Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions
- Mathematics, Computer ScienceAISTATS
- 2015
This work considers the least-squares regression problem and provides a detailed asymptotic analysis of the performance of averaged constant-step-size stochastic gradient descent, and provides an asymPTotic expansion up to explicit exponentially decaying terms.
Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification
- Computer ScienceJ. Mach. Learn. Res.
- 2017
A novel analysis is developed in bounding these operators to characterize the excess risk of communication efficient parallelization schemes such as model-averaging/parameter mixing methods, which are of broader interest in analyzing computational aspects of stochastic approximation.