• Corpus ID: 212737100

The Implicit Regularization of Stochastic Gradient Flow for Least Squares

@inproceedings{Ali2020TheIR,
title={The Implicit Regularization of Stochastic Gradient Flow for Least Squares},
author={Alnur Ali and Edgar Dobriban and Ryan J. Tibshirani},
booktitle={International Conference on Machine Learning},
year={2020}
}
• Published in
International Conference on…
17 March 2020
• Mathematics, Computer Science
We study the implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression. We leverage a continuous-time stochastic differential equation having the same moments as stochastic gradient descent, which we call stochastic gradient flow. We give a bound on the excess risk of stochastic gradient flow at time $t$, over ridge regression with tuning parameter $\lambda = 1/t$. The bound may be computed from explicit constants (e.g…

Figures from this paper

• Computer Science
ICLR
• 2021
This work indicates that backward error analysis is a useful theoretical approach to the perennial question of how learning rate, model size, and parameter regularization interact to determine the properties of overparameterized models optimized with gradient descent.
• Computer Science
NeurIPS
• 2020
The theory is applied to recover, in a clean and elegant manner via rather short proofs, some of the recent results in the implicit regularization literature, while also showing how to improve upon them in some settings.
• Computer Science
ArXiv
• 2020
This paper makes an initial attempt to characterize the particular regularization effect of SGD in the moderate learning rate regime by studying its behavior for optimizing an overparameterized linear regression problem, and shows that directional bias does matter when early stopping is adopted.
By comparing the random learning rate protocol with cyclic and constant protocols, it is suggested that the random choice is generically the best strategy in the small learning rate regime, yielding better regularization without extra computational cost.
• Computer Science
NeurIPS
• 2020
This work applies mirror descent to the unconstrained empirical risk minimization problem (batch setting), using the square loss and square measurements, and provides a convergence analysis of the algorithm in this non-convex setting and proves that, with the hypentropy mirror map, mirror descent recovers any $k$-sparse vector.
• Computer Science
ICML
• 2021
It is claimed that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $\eta$ and $b$, the SGD iterates will converge to a stationary distribution, and these results are the first of their kind to rigorously characterize the empirically observed heavy-tailed behavior of SGD.
• Wei Huang
• Computer Science
ArXiv
• 2020
This work characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in the large learning rate regime, inspired by the seminal work by Lewkowycz et al. in a regression setting with squared loss and proves that flatter minima in the space spanned by non-separable data along with the learning rate in the catapult phase can lead to better generalization empirically.
• Computer Science
NeurIPS
• 2020
The results suggest that, rather than perceiving the implicit regularization via norms, a potentially more useful interpretation is minimization of rank, and it is demonstrated empirically that this interpretation extends to a certain class of non-linear neural networks, and hypothesize that it may be key to explaining generalization in deep learning.
• Computer Science
ICML
• 2021
Motivated by tensor rank capturing the implicit regularization of a non-linear neural network, this work empirically explores it as a measure of complexity, and finds that it captures the essence of datasets on which neural networks generalize, leading to a belief that tensorRank may pave way to explaining both implicitRegularization in deep learning and the properties of real-world data translating this implicitregularization to generalization.
We present some observations about neural networks that are, on the one hand, the result of fairly trivial algebraic manipulations, and on the other hand, potentially noteworthy and deserving of

References

SHOWING 1-10 OF 104 REFERENCES

• Mathematics, Computer Science
AISTATS
• 2019
The primary focus is to compare the risk of gradient flow to that of ridge regression, and it is proved that the same relative risk bound holds for prediction risk, in an average sense over the underlying signal $\beta_0$.
• Computer Science
J. Mach. Learn. Res.
• 2017
We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite
• Computer Science, Mathematics
The Annals of Statistics
• 2020
It is shown that Richardson-Romberg extrapolation may be used to get closer to the global optimum and an explicit asymptotic expansion of the moments of the averaged SGD iterates that outlines the dependence on initial conditions, the effect of noise and the step-size, as well as the lack of convergence in the general case.
• Computer Science, Mathematics
NIPS
• 2013
We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which
• Computer Science
J. Mach. Learn. Res.
• 2016
This work considers the square loss and shows that for a universal step-size choice, the number of passes acts as a regularization parameter, and optimal finite sample bounds can be achieved by early-stopping.
• Computer Science, Mathematics
• 2014
The theoretical analysis provides the first full characterization of the asymptotic behavior of both standard and implicit stochastic gradient descent-based estimators, including finite-sample error bounds, and suggests that implicit stochy gradient descent procedures are poised to become a workhorse for approximate inference from large data sets.
• Computer Science, Mathematics
COLT
• 2017
The present work provides a nonasymptotic analysis in the context of non-convex learning problems, giving finite-time guarantees for SGLD to find approximate minimizers of both empirical and population risks.
• Computer Science
Annals of Mathematical Sciences and Applications
• 2019
It is proved rigorously that the diffusion process can approximate the SGD algorithm weakly using the weak form of master equation for probability evolution, and the effects of batch size for the deep neural networks are discussed, finding that small batch size is helpful for SGD algorithms to escape unstable stationary points and sharp minimizers.
• Mathematics
AISTATS 2014
• 2014
This work considers the least-squares regression problem and provides a detailed asymptotic analysis of the performance of averaged constant-step-size stochastic gradient descent (a.k.a. least-mean- Squares) and provides an asymPTotic expansion up to explicit exponentially decaying terms.
• Computer Science
J. Mach. Learn. Res.
• 2017
A novel analysis is developed in bounding these operators to characterize the excess risk of communication efficient parallelization schemes such as model-averaging/parameter mixing methods, which are of broader interest in analyzing computational aspects of stochastic approximation.