• Corpus ID: 247158139

Benign Underfitting of Stochastic Gradient Descent

  title={Benign Underfitting of Stochastic Gradient Descent},
  author={Tomer Koren and Roi Livni and Y. Mansour and Uri Sherman},
We study to what extent may stochastic gradient descent (SGD) be understood as a “conventional” learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without -replacement) SGD is classically known to minimize the population risk at rate O (1 / √ n ), and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and… 

Figures from this paper

Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks

This paper considers gradient descent and stochastic gradient descent to train SNNs, for both of which it develops consistent excess risk bounds by balancing the optimization and generalization via early-stopping by leveraging the concept of algorithmic stability.

Max-Margin Works while Large Margin Fails: Generalization without Uniform Convergence

A new type of margin bound is proved showing that above a certain signal-to-noise threshold, any near-max-margin classifier will achieve almost no test loss in these two settings, and provides insight on why memorization can coexist with generalization.

Making Progress Based on False Discoveries

A generic reduction from the standard setting of statistical queries to the problem of estimating gradients queried by gradient descent is provided, in contrast with classical bounds that show that with O (1 /ε 2 ) samples one can optimize the population risk to accuracy of O ( ε ) but, as it turns out, with spurious gradients.



Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses

This work provides sharp upper and lower bounds for several forms of SGD and full-batch GD on arbitrary Lipschitz nonsmooth convex losses and obtains the first dimension-independent generalization bounds for multi-pass SGD in the nonssooth case.

SGD Generalizes Better Than GD (And Regularization Doesn't Help)

It is shown that with the same number of steps GD may overfit and emit a solution with Ω(1) generalization error, and how regularizing the empirical risk minimized by GD essentially does not change the above result.

Random Reshuffling: Simple Analysis with Vast Improvements

The theory for strongly-convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shuffling once or only a few times and proves fast convergence of the Shuffle-Once algorithm, which shuffles the data only once.

SGD without Replacement: Sharper Rates for General Smooth Convex Functions

The first non-asymptotic results for stochastic gradient descent when applied to general smooth, strongly-convex functions are provided, which show that sgdwor converges at a rate of O(1/K^2) while sgd is known to converge at $O( 1/K) rate.

Learnability, Stability and Uniform Convergence

This paper considers the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases, and identifies stability as the key necessary and sufficient condition for learnability.

Random Shuffling Beats SGD Only After Many Epochs on Ill-Conditioned Problems

It is proved that when the condition number is taken into account, without-replacements SGD does not significantly improve on withreplacement SGD in terms of worst-case bounds, unless the number of epochs (passes over the data) is larger than the conditionNumber.

Deep learning: a statistical viewpoint

This article surveys recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings, and focuses specifically on the linear regime for neural networks, where the network can be approximated by a linear model.

Closing the convergence gap of SGD without replacement

It is shown that SGD without replacement achieves a rate of $\mathcal{O}\left(\frac{1}{T^2}+\frac{n^2}{ T^3}\right)$ when the sum of the functions is a quadratic, and a new lower bound is offered of $\Omega\left(frac{ n}{T ^2}\ right)$ for strongly convex functions that are sums of smooth functions.

Stability and Generalization

These notions of stability for learning algorithms are defined and it is shown how to use these notions to derive generalization error bounds based on the empirical error and the leave-one-out error.

The Implicit Bias of Benign Overfitting

It is shown that for regression, benign overfitting is “biased” towards certain types of problems, in the sense that its existence on one learning problem precludes itsexistence on other learning problems.