# Benign Underfitting of Stochastic Gradient Descent

@article{Koren2022BenignUO, title={Benign Underfitting of Stochastic Gradient Descent}, author={Tomer Koren and Roi Livni and Y. Mansour and Uri Sherman}, journal={ArXiv}, year={2022}, volume={abs/2202.13361} }

We study to what extent may stochastic gradient descent (SGD) be understood as a “conventional” learning rule that achieves generalization performance by obtaining a good ﬁt to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without -replacement) SGD is classically known to minimize the population risk at rate O (1 / √ n ), and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and…

## Figures from this paper

## 3 Citations

### Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks

- Computer ScienceArXiv
- 2022

This paper considers gradient descent and stochastic gradient descent to train SNNs, for both of which it develops consistent excess risk bounds by balancing the optimization and generalization via early-stopping by leveraging the concept of algorithmic stability.

### Max-Margin Works while Large Margin Fails: Generalization without Uniform Convergence

- Computer ScienceArXiv
- 2022

A new type of margin bound is proved showing that above a certain signal-to-noise threshold, any near-max-margin classiﬁer will achieve almost no test loss in these two settings, and provides insight on why memorization can coexist with generalization.

### Making Progress Based on False Discoveries

- Computer ScienceArXiv
- 2022

A generic reduction from the standard setting of statistical queries to the problem of estimating gradients queried by gradient descent is provided, in contrast with classical bounds that show that with O (1 /ε 2 ) samples one can optimize the population risk to accuracy of O ( ε ) but, as it turns out, with spurious gradients.

## References

SHOWING 1-10 OF 43 REFERENCES

### Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses

- Computer ScienceNeurIPS
- 2020

This work provides sharp upper and lower bounds for several forms of SGD and full-batch GD on arbitrary Lipschitz nonsmooth convex losses and obtains the first dimension-independent generalization bounds for multi-pass SGD in the nonssooth case.

### SGD Generalizes Better Than GD (And Regularization Doesn't Help)

- Computer ScienceCOLT
- 2021

It is shown that with the same number of steps GD may overfit and emit a solution with Ω(1) generalization error, and how regularizing the empirical risk minimized by GD essentially does not change the above result.

### Random Reshuffling: Simple Analysis with Vast Improvements

- Computer ScienceNeurIPS
- 2020

The theory for strongly-convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shuffling once or only a few times and proves fast convergence of the Shuffle-Once algorithm, which shuffles the data only once.

### SGD without Replacement: Sharper Rates for General Smooth Convex Functions

- Computer ScienceICML
- 2019

The first non-asymptotic results for stochastic gradient descent when applied to general smooth, strongly-convex functions are provided, which show that sgdwor converges at a rate of O(1/K^2) while sgd is known to converge at $O( 1/K) rate.

### Learnability, Stability and Uniform Convergence

- Computer ScienceJ. Mach. Learn. Res.
- 2010

This paper considers the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases, and identifies stability as the key necessary and sufficient condition for learnability.

### Random Shuffling Beats SGD Only After Many Epochs on Ill-Conditioned Problems

- Computer ScienceNeurIPS
- 2021

It is proved that when the condition number is taken into account, without-replacements SGD does not significantly improve on withreplacement SGD in terms of worst-case bounds, unless the number of epochs (passes over the data) is larger than the conditionNumber.

### Deep learning: a statistical viewpoint

- Computer ScienceActa Numerica
- 2021

This article surveys recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings, and focuses specifically on the linear regime for neural networks, where the network can be approximated by a linear model.

### Closing the convergence gap of SGD without replacement

- Computer Science, MathematicsICML
- 2020

It is shown that SGD without replacement achieves a rate of $\mathcal{O}\left(\frac{1}{T^2}+\frac{n^2}{ T^3}\right)$ when the sum of the functions is a quadratic, and a new lower bound is offered of $\Omega\left(frac{ n}{T ^2}\ right)$ for strongly convex functions that are sums of smooth functions.

### Stability and Generalization

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2002

These notions of stability for learning algorithms are defined and it is shown how to use these notions to derive generalization error bounds based on the empirical error and the leave-one-out error.

### The Implicit Bias of Benign Overfitting

- Computer ScienceCOLT
- 2022

It is shown that for regression, benign overﬁtting is “biased” towards certain types of problems, in the sense that its existence on one learning problem precludes itsexistence on other learning problems.