The Implicit Regularization of Stochastic Gradient Flow for Least Squares
@inproceedings{Ali2020TheIR, title={The Implicit Regularization of Stochastic Gradient Flow for Least Squares}, author={Alnur Ali and Edgar Dobriban and Ryan J. Tibshirani}, booktitle={International Conference on Machine Learning}, year={2020} }
We study the implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression. We leverage a continuous-time stochastic differential equation having the same moments as stochastic gradient descent, which we call stochastic gradient flow. We give a bound on the excess risk of stochastic gradient flow at time $t$, over ridge regression with tuning parameter $\lambda = 1/t$. The bound may be computed from explicit constants (e.g…
55 Citations
Implicit Gradient Regularization
- Computer ScienceICLR
- 2021
This work indicates that backward error analysis is a useful theoretical approach to the perennial question of how learning rate, model size, and parameter regularization interact to determine the properties of overparameterized models optimized with gradient descent.
The Statistical Complexity of Early Stopped Mirror Descent
- Computer ScienceNeurIPS
- 2020
The theory is applied to recover, in a clean and elegant manner via rather short proofs, some of the recent results in the implicit regularization literature, while also showing how to improve upon them in some settings.
Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate
- Computer ScienceArXiv
- 2020
This paper makes an initial attempt to characterize the particular regularization effect of SGD in the moderate learning rate regime by studying its behavior for optimizing an overparameterized linear regression problem, and shows that directional bias does matter when early stopping is adopted.
Stochastic gradient descent with random learning rate
- Computer ScienceArXiv
- 2020
By comparing the random learning rate protocol with cyclic and constant protocols, it is suggested that the random choice is generically the best strategy in the small learning rate regime, yielding better regularization without extra computational cost.
A Continuous-Time Mirror Descent Approach to Sparse Phase Retrieval
- Computer ScienceNeurIPS
- 2020
This work applies mirror descent to the unconstrained empirical risk minimization problem (batch setting), using the square loss and square measurements, and provides a convergence analysis of the algorithm in this non-convex setting and proves that, with the hypentropy mirror map, mirror descent recovers any $k$-sparse vector.
The Heavy-Tail Phenomenon in SGD
- Computer ScienceICML
- 2021
It is claimed that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $\eta$ and $b$, the SGD iterates will converge to a stationary distribution, and these results are the first of their kind to rigorously characterize the empirically observed heavy-tailed behavior of SGD.
Implicit bias of deep linear networks in the large learning rate phase
- Computer ScienceArXiv
- 2020
This work characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in the large learning rate regime, inspired by the seminal work by Lewkowycz et al. in a regression setting with squared loss and proves that flatter minima in the space spanned by non-separable data along with the learning rate in the catapult phase can lead to better generalization empirically.
Implicit Regularization in Deep Learning May Not Be Explainable by Norms
- Computer ScienceNeurIPS
- 2020
The results suggest that, rather than perceiving the implicit regularization via norms, a potentially more useful interpretation is minimization of rank, and it is demonstrated empirically that this interpretation extends to a certain class of non-linear neural networks, and hypothesize that it may be key to explaining generalization in deep learning.
Implicit Regularization in Tensor Factorization
- Computer ScienceICML
- 2021
Motivated by tensor rank capturing the implicit regularization of a non-linear neural network, this work empirically explores it as a measure of complexity, and finds that it captures the essence of datasets on which neural networks generalize, leading to a belief that tensorRank may pave way to explaining both implicitRegularization in deep learning and the properties of real-world data translating this implicitregularization to generalization.
Equivalences Between Sparse Models and Neural Networks
- Computer Science
- 2021
We present some observations about neural networks that are, on the one hand, the result of fairly trivial algebraic manipulations, and on the other hand, potentially noteworthy and deserving of…
References
SHOWING 1-10 OF 104 REFERENCES
A Continuous-Time View of Early Stopping for Least Squares Regression
- Mathematics, Computer ScienceAISTATS
- 2019
The primary focus is to compare the risk of gradient flow to that of ridge regression, and it is proved that the same relative risk bound holds for prediction risk, in an average sense over the underlying signal $\beta_0$.
Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression
- Computer ScienceJ. Mach. Learn. Res.
- 2017
We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite…
Bridging the gap between constant step size stochastic gradient descent and Markov chains
- Computer Science, MathematicsThe Annals of Statistics
- 2020
It is shown that Richardson-Romberg extrapolation may be used to get closer to the global optimum and an explicit asymptotic expansion of the moments of the averaged SGD iterates that outlines the dependence on initial conditions, the effect of noise and the step-size, as well as the lack of convergence in the general case.
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
- Computer Science, MathematicsNIPS
- 2013
We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which…
Optimal Rates for Multi-pass Stochastic Gradient Methods
- Computer ScienceJ. Mach. Learn. Res.
- 2016
This work considers the square loss and shows that for a universal step-size choice, the number of passes acts as a regularization parameter, and optimal finite sample bounds can be achieved by early-stopping.
Asymptotic and finite-sample properties of estimators based on stochastic gradients
- Computer Science, Mathematics
- 2014
The theoretical analysis provides the first full characterization of the asymptotic behavior of both standard and implicit stochastic gradient descent-based estimators, including finite-sample error bounds, and suggests that implicit stochy gradient descent procedures are poised to become a workhorse for approximate inference from large data sets.
Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis
- Computer Science, MathematicsCOLT
- 2017
The present work provides a nonasymptotic analysis in the context of non-convex learning problems, giving finite-time guarantees for SGLD to find approximate minimizers of both empirical and population risks.
On the diffusion approximation of nonconvex stochastic gradient descent
- Computer ScienceAnnals of Mathematical Sciences and Applications
- 2019
It is proved rigorously that the diffusion process can approximate the SGD algorithm weakly using the weak form of master equation for probability evolution, and the effects of batch size for the deep neural networks are discussed, finding that small batch size is helpful for SGD algorithms to escape unstable stationary points and sharp minimizers.
Constant Step Size Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions
- MathematicsAISTATS 2014
- 2014
This work considers the least-squares regression problem and provides a detailed asymptotic analysis of the performance of averaged constant-step-size stochastic gradient descent (a.k.a. least-mean- Squares) and provides an asymPTotic expansion up to explicit exponentially decaying terms.
Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification
- Computer ScienceJ. Mach. Learn. Res.
- 2017
A novel analysis is developed in bounding these operators to characterize the excess risk of communication efficient parallelization schemes such as model-averaging/parameter mixing methods, which are of broader interest in analyzing computational aspects of stochastic approximation.