# SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs

@article{Kale2021SGDTR, title={SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs}, author={Satyen Kale and Ayush Sekhari and Karthik Sridharan}, journal={ArXiv}, year={2021}, volume={abs/2107.05074} }

Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the method of choice for learning with large over-parameterized models. A popular theory for explaining why SGD works well in practice is that the algorithm has an implicit regularization that biases its output towards a good solution. Perhaps the theoretically most well understood learning setting for SGD is that of Stochastic Convex Optimization (SCO), where it is well known that SGD learns at a rate of O(1/ √ n), where n is… Expand

#### References

SHOWING 1-10 OF 32 REFERENCES

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

- Computer Science, Mathematics
- ICLR
- 2017

This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. Expand

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

- Computer Science, Mathematics
- ICML
- 2018

The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data. Expand

An Alternative View: When Does SGD Escape Local Minima?

- Computer Science, Mathematics
- ICML
- 2018

SGD will not get stuck at "sharp" local minima with small diameters, as long as the neighborhoods of these regions contain enough gradient information, which helps explain why SGD works so well for neural networks. Expand

SGD Converges to Global Minimum in Deep Learning via Star-convex Path

- Computer Science, Mathematics
- ICLR
- 2019

This analysis shows that SGD, although has long been considered as a randomized algorithm, converges in an intrinsically deterministic manner to a global minimum. Expand

Stochastic Convex Optimization

- Mathematics, Computer Science
- COLT
- 2009

Stochastic convex optimization is studied, and it is shown that the key ingredient is strong convexity and regularization, which is only a sufficient, but not necessary, condition for meaningful non-trivial learnability. Expand

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

- Computer Science, Mathematics
- NeurIPS
- 2019

It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. Expand

On the Convergence Rate of Training Recurrent Neural Networks

- Computer Science, Mathematics
- NeurIPS
- 2019

It is shown when the number of neurons is sufficiently large, meaning polynomial in the training data size and in thelinear convergence rate, then SGD is capable of minimizing the regression loss in the linear convergence rate and gives theoretical evidence of how RNNs can memorize data. Expand

The Implicit Bias of Gradient Descent on Separable Data

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2018

We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the… Expand

Understanding deep learning (still) requires rethinking generalization

- Computer Science
- Commun. ACM
- 2021

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity. Expand

Implicit Regularization in Deep Matrix Factorization

- Computer Science, Mathematics
- NeurIPS
- 2019

This work studies the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization, and finds that adding depth to a matrix factorizations enhances an implicit tendency towards low-rank solutions. Expand