# To understand deep learning we need to understand kernel learning

@article{Belkin2018ToUD, title={To understand deep learning we need to understand kernel learning}, author={Mikhail Belkin and Siyuan Ma and Soumik Mandal}, journal={ArXiv}, year={2018}, volume={abs/1802.01396} }

Generalization performance of classifiers in deep learning has recently become a subject of intense study. [... ] Key Result Since most generalization bounds depend polynomially on the norm of the solution, this result implies that they diverge as data increases. Furthermore, the existing bounds do not apply to interpolated classifiers.
We also show experimentally that (non-smooth) Laplacian kernels easily fit random labels using a version of SGD, a finding that parallels results reported for ReLU neural… Expand

## Figures and Tables from this paper

## 281 Citations

Uniform convergence may be unable to explain generalization in deep learning

- Computer ScienceNeurIPS
- 2019

Through numerous experiments, doubt is cast on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well.

Towards an Understanding of Benign Overfitting in Neural Networks

- Computer ScienceArXiv
- 2021

It is shown that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate, which to this knowledge is the first generalization result for such networks.

Binary Classification of Gaussian Mixtures: Abundance of Support Vectors, Benign Overfitting, and Regularization

- Computer ScienceSIAM J. Math. Data Sci.
- 2022

This paper examines binary linear classification under a generative Gaussian mixture model in which the feature vectors take the form x = ±η + q, and identifies conditions under which the interpolating estimator performs better than corresponding regularized estimates.

On the Generalization Mystery in Deep Learning

- Computer ScienceArXiv
- 2022

The theory provides a causal explanation of how over-parameterized neural networks trained with gradient descent generalize well, and motivates a class of simple modiﬁcations to GD that attenuate memorization and improve generalization.

Theoretical issues in deep networks

- Computer ScienceProceedings of the National Academy of Sciences
- 2020

It is proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can avoid the curse of dimensionality.

Generalization Error of Generalized Linear Models in High Dimensions

- Computer ScienceICML
- 2020

This work provides a general framework to characterize the asymptotic generalization error for single-layer neural networks (i.e., generalized linear models) with arbitrary non-linearities, making it applicable to regression as well as classification problems.

Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

- Computer ScienceNeurIPS
- 2018

A theoretical foundation for interpolated classifiers is taken by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes, and consistency or near-consistency is proved for these schemes in classification and regression problems.

On the Convergence of Deep Networks with Sample Quadratic Overparameterization

- Computer ScienceArXiv
- 2021

A tight finite-width Neural Tangent Kernel (NTK) equivalence is derived, suggesting that neural networks trained with this technique generalize well at least as good as its NTK, and it can be used to study generalization as well.

Deep Learning Generalization, Extrapolation, and Over-parameterization

- Computer ScienceArXiv
- 2022

The training samples, testing samples, and decision boundaries of the models are closely related to each other in the pixel space, and the location of decision boundaries in the domain actually explain the main functional traits of the deep networks.

Max-Margin Works while Large Margin Fails: Generalization without Uniform Convergence

- Computer ScienceArXiv
- 2022

This analysis provides insight on why memorization can coexist with generalization: in this challenging regime where generalization occurs but UC fails, near-max-margin classifiers simultaneously contain some generalizable components and some overfitting components that memorize the data.

## References

SHOWING 1-10 OF 61 REFERENCES

Understanding deep learning requires rethinking generalization

- Computer ScienceICLR
- 2017

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

- Computer ScienceICML
- 2018

The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data.

Learning Multiple Layers of Features from Tiny Images

- Computer Science
- 2009

It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.

Theory of Deep Learning III: explaining the non-overfitting puzzle

- Computer ScienceArXiv
- 2018

It is shown that the dynamics associated to gradient descent minimization of nonlinear networks is topologically equivalent, near the asymptotically stable minima of the empirical error, to linear gradient system in a quadratic potential with a degenerate (for square loss) or almost degenerated (for logistic or crossentropy loss) Hessian.

The Implicit Bias of Gradient Descent on Separable Data

- Computer ScienceJ. Mach. Learn. Res.
- 2018

We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the…

Diving into the shallows: a computational perspective on large-scale shallow learning

- Computer ScienceNIPS
- 2017

EigenPro iteration is introduced, based on a preconditioning scheme using a small number of approximately computed eigenvectors, which turns out that injecting this small (computationally inexpensive and SGD-compatible) amount of approximate second-order information leads to major improvements in convergence.

Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks

- Computer Science, MathematicsNIPS
- 2017

This work shows that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs).

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

- Computer ScienceICLR
- 2017

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time.

An Analysis of Deep Neural Network Models for Practical Applications

- Computer ScienceArXiv
- 2016

This work presents a comprehensive analysis of important metrics in practical applications: accuracy, memory footprint, parameters, operations count, inference time and power consumption and believes it provides a compelling set of information that helps design and engineer efficient DNNs.

Approximation beats concentration? An approximation view on inference with smooth radial kernels

- Computer ScienceCOLT
- 2018

This paper takes the approximation theory point of view to explore various aspects of smooth kernels related to their inferential properties, and sees that the eigenvalues of kernel matrices show nearly exponential decay with constants depending only on the kernel and the domain.