Implicit Regularization Properties of Variance Reduced Stochastic Mirror Descent

  title={Implicit Regularization Properties of Variance Reduced Stochastic Mirror Descent},
  author={Yiling Luo and Xiaoming Huo and Yajun Mei},
  journal={2022 IEEE International Symposium on Information Theory (ISIT)},
  • Yiling Luo, X. Huo, Y. Mei
  • Published 29 April 2022
  • Mathematics, Computer Science
  • 2022 IEEE International Symposium on Information Theory (ISIT)
In machine learning and statistical data analysis, we often run into objective function that is a summation: the number of terms in the summation possibly is equal to the sample size, which can be enormous. In such a setting, the stochastic mirror descent (SMD) algorithm is a numerically efficient method—each iteration involving a very small subset of the data. The variance reduction version of SMD (VRSMD) can further improve SMD by inducing faster convergence. On the other hand, algorithms… 

Figures from this paper



The Statistical Complexity of Early Stopped Mirror Descent

The theory is applied to recover, in a clean and elegant manner via rather short proofs, some of the recent results in the implicit regularization literature, while also showing how to improve upon them in some settings.

Stochastic Gradient/Mirror Descent: Minimax Optimality and Implicit Regularization

It is argued how this identity can be used in the so-called "highly over-parameterized" nonlinear setting to provide insights into why SMD (and SGD) may have similar convergence and implicit regularization properties for deep learning.

Exact expressions for double descent and implicit regularization via surrogate random design

This work provides the first exact non-asymptotic expressions for double descent of the minimum norm linear estimator and introduces a new mathematical tool of independent interest: the class of random matrices for which determinant commutes with expectation.

On the Origin of Implicit Regularization in Stochastic Gradient Descent

It is proved that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite, but on a modified loss.

Implicit regularization via hadamard product over-parametrization in high-dimensional linear regression

It is shown that under certain conditions, this over-parametrization leads to implicit regularization: if the authors directly apply gradient descent to the residual sum of squares with sufficiently small initial values, then under proper early stopping rule, the iterates converge to a nearly sparse rate-optimal solution with relatively better accuracy than explicit regularized approaches.

The Implicit Regularization of Stochastic Gradient Flow for Least Squares

The implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression, is studied, finding that under no conditions on the data matrix $X, and across the entire optimization path, the results hold.

Understanding Implicit Regularization in Over-Parameterized Nonlinear Statistical Model

This work constructs an over-parameterized least-squares loss function by employing the score function transform and a robust truncation step designed specifically for heavy-tailed data, and proposes to estimate the true parameter by applying regularization-free gradient descent to the loss function.

Implicit Regularization for Optimal Sparse Recovery

We investigate implicit regularization schemes for gradient descent methods applied to unpenalized least squares regression to solve the problem of reconstructing a sparse signal from an

Deep learning: a statistical viewpoint

This article surveys recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings, and focuses specifically on the linear regime for neural networks, where the network can be approximated by a linear model.

Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

This work describes the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm of the gradient of the model with respect to the parameter vector, evaluated at each data point.