The Directional Bias Helps Stochastic Gradient Descent to Generalize in Kernel Regression Models

@article{Luo2022TheDB,
  title={The Directional Bias Helps Stochastic Gradient Descent to Generalize in Kernel Regression Models},
  author={Yiling Luo and Xiaoming Huo and Yajun Mei},
  journal={2022 IEEE International Symposium on Information Theory (ISIT)},
  year={2022},
  pages={678-683}
}
  • Yiling LuoX. HuoY. Mei
  • Published 29 April 2022
  • Computer Science
  • 2022 IEEE International Symposium on Information Theory (ISIT)
We study the Stochastic Gradient Descent (SGD) algorithm in nonparametric statistics: kernel regression in particular. The directional bias property of SGD, which is known in the linear regression setting, is generalized to the kernel regression. More specifically, we prove that SGD with moderate and annealing step-size converges along the direction of the eigenvector that corresponds to the largest eigenvalue of the Gram matrix. In addition, the Gradient Descent (GD) with a moderate or small… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 33 REFERENCES

The Implicit Regularization of Stochastic Gradient Flow for Least Squares

The implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression, is studied, finding that under no conditions on the data matrix $X, and across the entire optimization path, the results hold.

Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate

The theory explains several folk arts in practice used for SGD hyperparameter tuning, such as linearly scaling the initial learning rate with batch size; and overrunning SGD with high learning rate even when the loss stops decreasing.

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Stochastic gradient descent for least-squares regression with potentially several passes with potentially infinite-dimensional models and notions typically associated to kernel methods is considered, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariances matrix.

On Early Stopping in Gradient Descent Learning

A family of gradient descent algorithms to approximate the regression function from reproducing kernel Hilbert spaces (RKHSs) is studied, the family being characterized by a polynomial decreasing rate of step sizes (or learning rate).

Just Interpolate: Kernel "Ridgeless" Regression Can Generalize

This work isolates a phenomenon of implicit regularization for minimum-norm interpolated solutions which is due to a combination of high dimensionality of the input data, curvature of the kernel function, and favorable geometric properties of the data such as an eigenvalue decay of the empirical covariance and kernel matrices.

Implicit regularization via hadamard product over-parametrization in high-dimensional linear regression

It is shown that under certain conditions, this over-parametrization leads to implicit regularization: if the authors directly apply gradient descent to the residual sum of squares with sufficiently small initial values, then under proper early stopping rule, the iterates converge to a nearly sparse rate-optimal solution with relatively better accuracy than explicit regularized approaches.

Towards Understanding the Spectral Bias of Deep Learning

It is proved that the training process of neural networks can be decomposed along different directions defined by the eigenfunctions of the neural tangent kernel, where each direction has its own convergence rate and the rate is determined by the corresponding eigenvalue.

Optimal Rates for Multi-pass Stochastic Gradient Methods

This work considers the square loss and shows that for a universal step-size choice, the number of passes acts as a regularization parameter, and optimal finite sample bounds can be achieved by early-stopping.

Exact expressions for double descent and implicit regularization via surrogate random design

This work provides the first exact non-asymptotic expressions for double descent of the minimum norm linear estimator and introduces a new mathematical tool of independent interest: the class of random matrices for which determinant commutes with expectation.

Reducing Kernel Matrix Diagonal Dominance Using Semi-definite Programming

This paper proposes an algorithm for manipulating the diagonal entries of a kernel matrix using semi-definite programming, and provides an analysis using Rademacher based bounds to provide an alternative motivation for the 1-norm SVM motivated from kernel diagonal reduction.