• Corpus ID: 211011179

The Statistical Complexity of Early Stopped Mirror Descent

@article{Vaskevicius2020TheSC,
  title={The Statistical Complexity of Early Stopped Mirror Descent},
  author={Tomas Vaskevicius and Varun Kanade and Patrick Rebeschini},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.00189}
}
Recently there has been a surge of interest in understanding implicit regularization properties of iterative gradient-based optimization algorithms. In this paper, we study the statistical guarantees on the excess risk achieved by early-stopped unconstrained mirror descent algorithms applied to the unregularized empirical risk with the squared loss for linear models and kernel methods. By completing an inequality that characterizes convexity for the squared loss, we identify an intrinsic link… 

Tables from this paper

Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity
TLDR
The findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances of stochastic gradient descent over gradient descent observed in practice.
Exponential Tail Local Rademacher Complexity Risk Bounds Without the Bernstein Condition
TLDR
This work builds upon the recent approach to localization via offset Rademacher complexities, for which a general high-probability theory has yet to be established and yields results at least as sharp as those obtainable via the classical theory.
A Continuous-Time Mirror Descent Approach to Sparse Phase Retrieval
TLDR
This work applies mirror descent to the unconstrained empirical risk minimization problem (batch setting), using the square loss and square measurements, and provides a convergence analysis of the algorithm in this non-convex setting and proves that, with the hypentropy mirror map, mirror descent recovers any $k$-sparse vector.
Accelerated Gradient Flow: Risk, Stability, and Implicit Regularization
TLDR
The statistical risk of the iterates generated by Nesterov's accelerated gradient method and Polyak's heavy ball method, when applied to least squares regression, is studied, drawing several connections to explicit penalization.
Implicit Regularization Properties of Variance Reduced Stochastic Mirror Descent
TLDR
It is proved here that the discrete VR SMD estimator sequence converges to the minimum mirror interpolant in the linear regression, establishing the implicit regularization property for VRSMD.
Iterative regularization for low complexity regularizers
TLDR
This work proposes and study the first iterative regularization procedure able to handle biases described by non smooth and non strongly convex functionals, prominent in low-complexity regularization.
On Optimal Early Stopping: Over-informative versus Under-informative Parametrization
TLDR
This work develops theoretical results to reveal the relationship between the optimal early stopping time and model dimension as well as sample size of the dataset for certain linear models and proposes a model to study this setting.
Sobolev Acceleration and Statistical Optimality for Learning Elliptic Equations via Gradient Descent
TLDR
An implicit acceleration of using a Sobolev norm as the objective function for training is explained, inferring that the optimal number of epochs of DRM becomes larger than the number of PINN when both the data size and the hardness of tasks increase, although both DRM and PINN can achieve statistical optimality.
From inexact optimization to learning via gradient concentration
TLDR
This paper shows how probabilistic results, specifically gradient concentration, can be combined with results from inexact optimization to derive sharp test error guarantees and highlights the implicit regularization properties of optimization for learning.
Implicit Regularization in Matrix Sensing via Mirror Descent
We study discrete-time mirror descent applied to the unregularized empirical risk in matrix sensing. In both the general case of rectangular matrices and the particular case of positive semidefinite
...
1
2
...

References

SHOWING 1-10 OF 60 REFERENCES
The statistical complexity of early stopped mirror descent
  • arXiv preprint arXiv:2002.00189,
  • 2020
Learning with Square Loss: Localization through Offset Rademacher Complexity
TLDR
A notion of offset Rademacher complexity is introduced that provides a transparent way to study localization both in expectation and in high probability and is shown to be upper bounded by this offset complexity through a novel geometric inequality.
A Continuous-Time View of Early Stopping for Least Squares Regression
TLDR
The primary focus is to compare the risk of gradient flow to that of ridge regression, and it is proved that the same relative risk bound holds for prediction risk, in an average sense over the underlying signal $\beta_0$.
Early Stopping for Kernel Boosting Algorithms: A General Analysis With Localized Complexities
TLDR
This paper exhibits a direct connection between the performance of a stopped iterate and the localized Gaussian complexity of the associated function class, and shows that the local fixed point analysis of Gaussian or Rademacher complexities can be used to derive optimal stopping rules.
Learning without Concentration
We obtain sharp bounds on the estimation error of the Empirical Risk Minimization procedure, performed in a convex class and with respect to the squared loss, without assuming that class members and
Early stopping for non-parametric regression: An optimal data-dependent stopping rule
TLDR
This paper derives upper bounds on both the L<sup>2</sup>(P<inf>n</inf>) and L(P) error for arbitrary RKHSs, and provides an explicit and easily computable data-dependent stopping rule that depends only on the sum of step-sizes and the eigenvalues of the empirical kernel matrix for the RK HS.
Exponentiated Gradient Meets Gradient Descent
TLDR
A new regularization is introduced which provides a unification of the additive and multiplicative updates derived from an hyperbolic analogue of the entropy function, which is motivated by a natural extension of the multiplicative update to negative numbers.
Graph-Dependent Implicit Regularisation for Distributed Stochastic Subgradient Descent
TLDR
This work proposes graph-dependent implicit regularisation strategies for distributed stochastic subgradient descent (Distributed SGD) for convex problems in multi-agent learning that avoid the need for explicit regularisation in decentralised learning problems, such as adding constraints to the empirical risk minimisation rule.
Kernel and Rich Regimes in Overparametrized Models
TLDR
This work shows how the scale of the initialization controls the transition between the "kernel" and "rich" regimes and affects generalization properties in multilayer homogeneous models and highlights an interesting role for the width of a model in the case that the predictor is not identically zero at initialization.
The Implicit Regularization of Stochastic Gradient Flow for Least Squares
TLDR
The implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression, is studied, finding that under no conditions on the data matrix $X, and across the entire optimization path, the results hold.
...
1
2
3
4
5
...