• Corpus ID: 3617641

# To understand deep learning we need to understand kernel learning

@article{Belkin2018ToUD,
title={To understand deep learning we need to understand kernel learning},
author={Mikhail Belkin and Siyuan Ma and Soumik Mandal},
journal={ArXiv},
year={2018},
volume={abs/1802.01396}
}
• Published 5 February 2018
• Computer Science
• ArXiv
Generalization performance of classifiers in deep learning has recently become a subject of intense study. [] Key Result Since most generalization bounds depend polynomially on the norm of the solution, this result implies that they diverge as data increases. Furthermore, the existing bounds do not apply to interpolated classifiers. We also show experimentally that (non-smooth) Laplacian kernels easily fit random labels using a version of SGD, a finding that parallels results reported for ReLU neural…
281 Citations

## Figures and Tables from this paper

Uniform convergence may be unable to explain generalization in deep learning
• Computer Science
NeurIPS
• 2019
Through numerous experiments, doubt is cast on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well.
Towards an Understanding of Benign Overfitting in Neural Networks
• Computer Science
ArXiv
• 2021
It is shown that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate, which to this knowledge is the first generalization result for such networks.
Binary Classification of Gaussian Mixtures: Abundance of Support Vectors, Benign Overfitting, and Regularization
• Computer Science
SIAM J. Math. Data Sci.
• 2022
This paper examines binary linear classification under a generative Gaussian mixture model in which the feature vectors take the form x = ±η + q, and identifies conditions under which the interpolating estimator performs better than corresponding regularized estimates.
On the Generalization Mystery in Deep Learning
• Computer Science
ArXiv
• 2022
The theory provides a causal explanation of how over-parameterized neural networks trained with gradient descent generalize well, and motivates a class of simple modiﬁcations to GD that attenuate memorization and improve generalization.
Theoretical issues in deep networks
• Computer Science
Proceedings of the National Academy of Sciences
• 2020
It is proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can avoid the curse of dimensionality.
Generalization Error of Generalized Linear Models in High Dimensions
• Computer Science
ICML
• 2020
This work provides a general framework to characterize the asymptotic generalization error for single-layer neural networks (i.e., generalized linear models) with arbitrary non-linearities, making it applicable to regression as well as classification problems.
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate
• Computer Science
NeurIPS
• 2018
A theoretical foundation for interpolated classifiers is taken by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes, and consistency or near-consistency is proved for these schemes in classification and regression problems.
On the Convergence of Deep Networks with Sample Quadratic Overparameterization
• Computer Science
ArXiv
• 2021
A tight finite-width Neural Tangent Kernel (NTK) equivalence is derived, suggesting that neural networks trained with this technique generalize well at least as good as its NTK, and it can be used to study generalization as well.
Deep Learning Generalization, Extrapolation, and Over-parameterization
The training samples, testing samples, and decision boundaries of the models are closely related to each other in the pixel space, and the location of decision boundaries in the domain actually explain the main functional traits of the deep networks.
Max-Margin Works while Large Margin Fails: Generalization without Uniform Convergence
• Computer Science
ArXiv
• 2022
This analysis provides insight on why memorization can coexist with generalization: in this challenging regime where generalization occurs but UC fails, near-max-margin classifiers simultaneously contain some generalizable components and some overfitting components that memorize the data.

## References

SHOWING 1-10 OF 61 REFERENCES
Understanding deep learning requires rethinking generalization
• Computer Science
ICLR
• 2017
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
• Computer Science
ICML
• 2018
The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data.
Learning Multiple Layers of Features from Tiny Images
It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.
Theory of Deep Learning III: explaining the non-overfitting puzzle
• Computer Science
ArXiv
• 2018
It is shown that the dynamics associated to gradient descent minimization of nonlinear networks is topologically equivalent, near the asymptotically stable minima of the empirical error, to linear gradient system in a quadratic potential with a degenerate (for square loss) or almost degenerated (for logistic or crossentropy loss) Hessian.
The Implicit Bias of Gradient Descent on Separable Data
• Computer Science
J. Mach. Learn. Res.
• 2018
We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the
Diving into the shallows: a computational perspective on large-scale shallow learning
• Computer Science
NIPS
• 2017
EigenPro iteration is introduced, based on a preconditioning scheme using a small number of approximately computed eigenvectors, which turns out that injecting this small (computationally inexpensive and SGD-compatible) amount of approximate second-order information leads to major improvements in convergence.
Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks
• Computer Science, Mathematics
NIPS
• 2017
This work shows that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs).
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
• Computer Science
ICLR
• 2017
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time.
An Analysis of Deep Neural Network Models for Practical Applications
• Computer Science
ArXiv
• 2016
This work presents a comprehensive analysis of important metrics in practical applications: accuracy, memory footprint, parameters, operations count, inference time and power consumption and believes it provides a compelling set of information that helps design and engineer efficient DNNs.
Approximation beats concentration? An approximation view on inference with smooth radial kernels
This paper takes the approximation theory point of view to explore various aspects of smooth kernels related to their inferential properties, and sees that the eigenvalues of kernel matrices show nearly exponential decay with constants depending only on the kernel and the domain.