Corpus ID: 235794861

SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs

@article{Kale2021SGDTR,
  title={SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs},
  author={Satyen Kale and Ayush Sekhari and Karthik Sridharan},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.05074}
}
Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the method of choice for learning with large over-parameterized models. A popular theory for explaining why SGD works well in practice is that the algorithm has an implicit regularization that biases its output towards a good solution. Perhaps the theoretically most well understood learning setting for SGD is that of Stochastic Convex Optimization (SCO), where it is well known that SGD learns at a rate of O(1/ √ n), where n is… Expand

Figures from this paper

References

SHOWING 1-10 OF 32 REFERENCES
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
TLDR
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. Expand
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
TLDR
The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data. Expand
An Alternative View: When Does SGD Escape Local Minima?
TLDR
SGD will not get stuck at "sharp" local minima with small diameters, as long as the neighborhoods of these regions contain enough gradient information, which helps explain why SGD works so well for neural networks. Expand
SGD Converges to Global Minimum in Deep Learning via Star-convex Path
TLDR
This analysis shows that SGD, although has long been considered as a randomized algorithm, converges in an intrinsically deterministic manner to a global minimum. Expand
Stochastic Convex Optimization
TLDR
Stochastic convex optimization is studied, and it is shown that the key ingredient is strong convexity and regularization, which is only a sufficient, but not necessary, condition for meaningful non-trivial learnability. Expand
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
TLDR
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. Expand
On the Convergence Rate of Training Recurrent Neural Networks
TLDR
It is shown when the number of neurons is sufficiently large, meaning polynomial in the training data size and in thelinear convergence rate, then SGD is capable of minimizing the regression loss in the linear convergence rate and gives theoretical evidence of how RNNs can memorize data. Expand
The Implicit Bias of Gradient Descent on Separable Data
We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of theExpand
Understanding deep learning (still) requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity. Expand
Implicit Regularization in Deep Matrix Factorization
TLDR
This work studies the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization, and finds that adding depth to a matrix factorizations enhances an implicit tendency towards low-rank solutions. Expand
...
1
2
3
4
...