# Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

@article{PillaudVivien2018StatisticalOO, title={Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes}, author={Loucas Pillaud-Vivien and Alessandro Rudi and Francis R. Bach}, journal={ArXiv}, year={2018}, volume={abs/1805.10074} }

We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while…

## 63 Citations

### Generalization Performance of Multi-pass Stochastic Gradient Descent with Convex Loss Functions

- Computer ScienceJ. Mach. Learn. Res.
- 2021

This paper provides both optimal capacity-independent and capacity-dependent learning rates for SGD with general convex loss functions by providing a refined estimate on the norm of SGD iterates based on a careful martingale analysis and concentration inequalities on empirical processes.

### Sobolev Acceleration and Statistical Optimality for Learning Elliptic Equations via Gradient Descent

- Computer Science, MathematicsArXiv
- 2022

An implicit acceleration of using a Sobolev norm as the objective function for training is explained, inferring that the optimal number of epochs of DRM becomes larger than the number of PINN when both the data size and the hardness of tasks increase, although both DRM and PINN can achieve statistical optimality.

### On the Benefits of Large Learning Rates for Kernel Methods

- Computer ScienceCOLT
- 2022

This paper considers the minimization of a quadratic objective in a separable Hilbert space, and shows that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian’s eigenvectors.

### Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

- Computer ScienceJ. Mach. Learn. Res.
- 2020

The results show that distributed SGM has a smaller theoretical computational complexity, compared with distributed KRR and classic SGM, and even for a general non-distributed SA, they provide optimal, capacity-dependent convergence rates, for the case that the regression function may not be in the RKHS.

### Last iterate convergence of SGD for Least-Squares in the Interpolation regime

- Computer Science, MathematicsNeurIPS
- 2021

This work studies the noiseless model in the fundamental least-squares setup and gives explicit non-asymptotic convergence rates in the over-parameterized setting and leverage a fine-grained parameterization of the problem to exhibit polynomial rates that can be faster than $O(1/T)$.

### Learning Curves for SGD on Structured Features

- Computer ScienceICLR
- 2022

An exactly solveable model of stochastic gradient descent (SGD) which predicts test loss when training on features with arbitrary covariance structure is studied, and it is shown that the optimal batch size at a fixed compute budget is typically small and depends on the feature correlation structure, demonstrating the computational benefits of SGD with small batch sizes.

### Stochastic Gradient Descent Meets Distribution Regression

- Computer ScienceAISTATS
- 2021

This work focuses on distribution regression (DR), involving two stages of sampling, and provides theoretical guarantees for the performance of SGD for DR, which is optimal in a mini-max sense under standard assumptions.

### Interpolation, growth conditions, and stochastic gradient descent

- Computer Science
- 2020

The notion of interpolation is extended to stochastic optimization problems with general, first-order oracles, and a simple extension to `2-regularized minimization is provided, which opens the path to proximal-gradient methods and non-smooth optimization under interpolation.

### Stochastic Gradient Descent Meets Distribution Regression

- Computer Science
- 2020

This work focuses on distribution regression (DR), involving two stages of sampling, and provides theoretical guarantees for the performance of SGD for DR, which is optimal in a mini-max sense under standard assumptions.

### Convergences of Regularized Algorithms and Stochastic Gradient Methods with Random Projections

- Mathematics, Computer ScienceJ. Mach. Learn. Res.
- 2020

The least-squares regression problem over a Hilbert space is studied, covering nonparametric regression over a reproducing kernel Hilbert space as a special case, and optimal rates are obtained for regularized algorithms with randomized sketches, provided that the sketch dimension is proportional to the effective dimension up to a logarithmic factor.

## References

SHOWING 1-10 OF 45 REFERENCES

### Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

- Computer ScienceJ. Mach. Learn. Res.
- 2017

We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite…

### Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)

- Computer Science, MathematicsNIPS
- 2013

We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which…

### Optimal Rates for Regularization of Statistical Inverse Learning Problems

- MathematicsFound. Comput. Math.
- 2018

Strong and weak minimax optimal rates of convergence (as the number of observations n grows large) for a large class of spectral regularization methods over regularity classes defined through appropriate source conditions are obtained.

### Optimal Rates for Multi-pass Stochastic Gradient Methods

- Computer ScienceJ. Mach. Learn. Res.
- 2016

This work considers the square loss and shows that for a universal step-size choice, the number of passes acts as a regularization parameter, and optimal finite sample bounds can be achieved by early-stopping.

### Train faster, generalize better: Stability of stochastic gradient descent

- Computer ScienceICML
- 2016

We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically…

### Optimal Rates for the Regularized Least-Squares Algorithm

- Mathematics, Computer ScienceFound. Comput. Math.
- 2007

A complete minimax analysis of the problem is described, showing that the convergence rates obtained by regularized least-squares estimators are indeed optimal over a suitable class of priors defined by the considered kernel.

### Convergence rates of Kernel Conjugate Gradient for random design regression

- Mathematics, Computer Science
- 2016

We prove statistical rates of convergence for kernel-based least squares regression from i.i.d. data using a conjugate gradient algorithm, where regularization against overfitting is obtained by…

### Non-parametric Stochastic Approximation with Large Step sizes

- Computer Science, Mathematics
- 2014

In a stochastic approximation framework, it is shown that the averaged unregularized least-mean-square algorithm, given a sufficient large step-size, attains optimal rates of convergence for a variety of regimes for the smoothnesses of the optimal prediction function and the functions in $\mathcal{H}$.

### Learning with SGD and Random Features

- Computer ScienceNeurIPS
- 2018

This study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions by deriving optimal finite sample bounds, under standard assumptions.

### FALKON: An Optimal Large Scale Kernel Method

- Computer ScienceNIPS
- 2017

This paper proposes FALKON, a novel algorithm that allows to efficiently process millions of points, derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning.