• Corpus ID: 212737100

The Implicit Regularization of Stochastic Gradient Flow for Least Squares

  title={The Implicit Regularization of Stochastic Gradient Flow for Least Squares},
  author={Alnur Ali and Edgar Dobriban and Ryan J. Tibshirani},
  booktitle={International Conference on Machine Learning},
We study the implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression. We leverage a continuous-time stochastic differential equation having the same moments as stochastic gradient descent, which we call stochastic gradient flow. We give a bound on the excess risk of stochastic gradient flow at time $t$, over ridge regression with tuning parameter $\lambda = 1/t$. The bound may be computed from explicit constants (e.g… 

Figures from this paper

Implicit Gradient Regularization

This work indicates that backward error analysis is a useful theoretical approach to the perennial question of how learning rate, model size, and parameter regularization interact to determine the properties of overparameterized models optimized with gradient descent.

The Statistical Complexity of Early Stopped Mirror Descent

The theory is applied to recover, in a clean and elegant manner via rather short proofs, some of the recent results in the implicit regularization literature, while also showing how to improve upon them in some settings.

Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate

This paper makes an initial attempt to characterize the particular regularization effect of SGD in the moderate learning rate regime by studying its behavior for optimizing an overparameterized linear regression problem, and shows that directional bias does matter when early stopping is adopted.

Stochastic gradient descent with random learning rate

By comparing the random learning rate protocol with cyclic and constant protocols, it is suggested that the random choice is generically the best strategy in the small learning rate regime, yielding better regularization without extra computational cost.

A Continuous-Time Mirror Descent Approach to Sparse Phase Retrieval

This work applies mirror descent to the unconstrained empirical risk minimization problem (batch setting), using the square loss and square measurements, and provides a convergence analysis of the algorithm in this non-convex setting and proves that, with the hypentropy mirror map, mirror descent recovers any $k$-sparse vector.

The Heavy-Tail Phenomenon in SGD

It is claimed that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $\eta$ and $b$, the SGD iterates will converge to a stationary distribution, and these results are the first of their kind to rigorously characterize the empirically observed heavy-tailed behavior of SGD.

Implicit bias of deep linear networks in the large learning rate phase

This work characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in the large learning rate regime, inspired by the seminal work by Lewkowycz et al. in a regression setting with squared loss and proves that flatter minima in the space spanned by non-separable data along with the learning rate in the catapult phase can lead to better generalization empirically.

Implicit Regularization in Deep Learning May Not Be Explainable by Norms

The results suggest that, rather than perceiving the implicit regularization via norms, a potentially more useful interpretation is minimization of rank, and it is demonstrated empirically that this interpretation extends to a certain class of non-linear neural networks, and hypothesize that it may be key to explaining generalization in deep learning.

Implicit Regularization in Tensor Factorization

Motivated by tensor rank capturing the implicit regularization of a non-linear neural network, this work empirically explores it as a measure of complexity, and finds that it captures the essence of datasets on which neural networks generalize, leading to a belief that tensorRank may pave way to explaining both implicitRegularization in deep learning and the properties of real-world data translating this implicitregularization to generalization.

Equivalences Between Sparse Models and Neural Networks

We present some observations about neural networks that are, on the one hand, the result of fairly trivial algebraic manipulations, and on the other hand, potentially noteworthy and deserving of



A Continuous-Time View of Early Stopping for Least Squares Regression

The primary focus is to compare the risk of gradient flow to that of ridge regression, and it is proved that the same relative risk bound holds for prediction risk, in an average sense over the underlying signal $\beta_0$.

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite

Bridging the gap between constant step size stochastic gradient descent and Markov chains

It is shown that Richardson-Romberg extrapolation may be used to get closer to the global optimum and an explicit asymptotic expansion of the moments of the averaged SGD iterates that outlines the dependence on initial conditions, the effect of noise and the step-size, as well as the lack of convergence in the general case.

Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)

We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which

Optimal Rates for Multi-pass Stochastic Gradient Methods

This work considers the square loss and shows that for a universal step-size choice, the number of passes acts as a regularization parameter, and optimal finite sample bounds can be achieved by early-stopping.

Asymptotic and finite-sample properties of estimators based on stochastic gradients

The theoretical analysis provides the first full characterization of the asymptotic behavior of both standard and implicit stochastic gradient descent-based estimators, including finite-sample error bounds, and suggests that implicit stochy gradient descent procedures are poised to become a workhorse for approximate inference from large data sets.

Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis

The present work provides a nonasymptotic analysis in the context of non-convex learning problems, giving finite-time guarantees for SGLD to find approximate minimizers of both empirical and population risks.

On the diffusion approximation of nonconvex stochastic gradient descent

It is proved rigorously that the diffusion process can approximate the SGD algorithm weakly using the weak form of master equation for probability evolution, and the effects of batch size for the deep neural networks are discussed, finding that small batch size is helpful for SGD algorithms to escape unstable stationary points and sharp minimizers.

Constant Step Size Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions

This work considers the least-squares regression problem and provides a detailed asymptotic analysis of the performance of averaged constant-step-size stochastic gradient descent (a.k.a. least-mean- Squares) and provides an asymPTotic expansion up to explicit exponentially decaying terms.

Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification

A novel analysis is developed in bounding these operators to characterize the excess risk of communication efficient parallelization schemes such as model-averaging/parameter mixing methods, which are of broader interest in analyzing computational aspects of stochastic approximation.