• Corpus ID: 232320373

# Benign Overfitting of Constant-Stepsize SGD for Linear Regression

@article{Zou2021BenignOO,
title={Benign Overfitting of Constant-Stepsize SGD for Linear Regression},
author={Difan Zou and Jingfeng Wu and Vladimir Braverman and Quanquan Gu and Sham M. Kakade},
journal={ArXiv},
year={2021},
volume={abs/2103.12692}
}
• Published 23 March 2021
• Computer Science
• ArXiv
There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see a benign overfitting phenomenon in overparameterized settings for natural learning algorithms, such as stochastic gradient descent (SGD), where little to no explicit regularization has been employed. This work considers this issue in arguably the most basic setting: constant-stepsize SGD (with iterate averaging) for linear regression in the overparameterized…

## Figures from this paper

• Computer Science
ArXiv
• 2022
This paper generalizes the idea of benign overfitting to the whole training trajectory instead of the min-norm solution and derive a time-variant bound based on the trajectory analysis, and derives a time interval that suffices to guarantee a consistent generalization error for a given feature covariance.
• Computer Science
• 2022
This work theoretically studies compatibility under the setting of solving overparameterized linear regression with gradient descent, and demonstrates that in the sense of compatibility, generalization holds with signiﬁcantly weaker restrictions on the problem instance than the previous last iterate analysis.
• Yu HuangLongbo Huang
• Computer Science
ArXiv
• 2022
This paper studies the generalization of a widely used meta-learning approach, Model-Agnostic Meta-Learning (MAML), which aims to a good initialization for fast adaptation to new tasks under a mixed linear regression model.
• Computer Science
COLT
• 2022
This work presents the first practical algorithm that achieves the optimal prediction error rates in terms of dependence on the noise of the problem, as O(d/t) while accelerating the forgetting of the initial conditions to O( d/t).
• Computer Science
ArXiv
• 2021
The theoretical results demonstrate that, with SGD training, RF regression still generalizes well for interpolation learning, and is able to characterize the double descent behavior by the unimodality of variance and monotonic decrease of bias.
• Computer Science
ArXiv
• 2022
The goal of this paper is to sharply characterize the generalization of multi-pass SGD, by developing an instance-dependent excess risk bound for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance.
• Computer Science
ICML
• 2022
This paper provides a problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize, for (overparameterized) linear regression problems, and proves nearly matching upper and lower bounds on the excess risk.
• Computer Science
NeurIPS
• 2021
The results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances.
• Computer Science
ArXiv
• 2022
This paper performs a systematic study of a range of classical single-step and multi-step first order optimization algorithms, with adaptive and non-adaptive, constant andNon-constant learning rates, and proves that a power law spectral assumption entails a powerlaw for convergence rate of the algorithm.
• Computer Science
ArXiv
• 2022
It is shown that ﬁnetuning, even with only a small amount of target data, could drastically reduce the amount of source data required by pretraining, and bounds suggest that for a large class of linear regression instances, transfer learning with O ( N 2 ) source data is as eﬀective as supervised learning with N target data.

## References

SHOWING 1-10 OF 25 REFERENCES

• Computer Science
• 2020
This work provides non-asymptotic generalization bounds for overparametrized ridge regression that depend on the arbitrary covariance structure of the data, and shows that those bounds are tight for a range of regularization parameter values.
• Computer Science
Proceedings of the National Academy of Sciences
• 2020
A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.
• Computer Science, Mathematics
NIPS
• 2013
We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which
• Computer Science
FSTTCS
• 2017
This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent, for the special case of least squares, by analyzing SGD as a Stochastic process and sharply characterizing the stationary covariance matrix of this process.
© 2018, Cambridge University Press Let us summarize our findings. A random projection of a set T in R n onto an m-dimensional subspace approximately preserves the geometry of T if m ⪆ d ( T ) . For...
• Computer Science, Mathematics
J. Mach. Learn. Res.
• 2021
Borders on the population risk of the maximum margin algorithm for two-class linear classification are proved, and it is shown that, with sufficient over-parameterization, this algorithm trained on noisy data can achieve nearly optimal population risk.
• Computer Science, Mathematics
• 2014
In a stochastic approximation framework, it is shown that the averaged unregularized least-mean-square algorithm, given a sufficient large step-size, attains optimal rates of convergence for a variety of regimes for the smoothnesses of the optimal prediction function and the functions in $\mathcal{H}$.
• Computer Science
J. Mach. Learn. Res.
• 2017
We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite
• Mathematics, Computer Science
AISTATS
• 2015
This work considers the least-squares regression problem and provides a detailed asymptotic analysis of the performance of averaged constant-step-size stochastic gradient descent, and provides an asymPTotic expansion up to explicit exponentially decaying terms.
• Computer Science
J. Mach. Learn. Res.
• 2017
A novel analysis is developed in bounding these operators to characterize the excess risk of communication efficient parallelization schemes such as model-averaging/parameter mixing methods, which are of broader interest in analyzing computational aspects of stochastic approximation.