Corpus ID: 3617641

To understand deep learning we need to understand kernel learning

@article{Belkin2018ToUD,
  title={To understand deep learning we need to understand kernel learning},
  author={Mikhail Belkin and Siyuan Ma and Soumik Mandal},
  journal={ArXiv},
  year={2018},
  volume={abs/1802.01396}
}
Generalization performance of classifiers in deep learning has recently become a subject of intense study. [...] Key Result Since most generalization bounds depend polynomially on the norm of the solution, this result implies that they diverge as data increases. Furthermore, the existing bounds do not apply to interpolated classifiers. We also show experimentally that (non-smooth) Laplacian kernels easily fit random labels using a version of SGD, a finding that parallels results reported for ReLU neural…Expand
Uniform convergence may be unable to explain generalization in deep learning
TLDR
Through numerous experiments, doubt is cast on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well. Expand
Towards an Understanding of Benign Overfitting in Neural Networks
TLDR
It is shown that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate, which to this knowledge is the first generalization result for such networks. Expand
Theoretical issues in deep networks
TLDR
It is proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can avoid the curse of dimensionality. Expand
Generalization Error of Generalized Linear Models in High Dimensions
TLDR
This work provides a general framework to characterize the asymptotic generalization error for single-layer neural networks (i.e., generalized linear models) with arbitrary non-linearities, making it applicable to regression as well as classification problems. Expand
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate
TLDR
A theoretical foundation for interpolated classifiers is taken by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes, and consistency or near-consistency is proved for these schemes in classification and regression problems. Expand
On the Convergence of Deep Networks with Sample Quadratic Overparameterization
TLDR
A tight finite-width Neural Tangent Kernel (NTK) equivalence is derived, suggesting that neural networks trained with this technique generalize well at least as good as its NTK, and it can be used to study generalization as well. Expand
A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks
TLDR
It is proved that under certain assumption on the data distribution that is milder than linear separability, gradient descent with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small expected error, leading to an algorithmic-dependent generalization error bound for deep learning. Expand
The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve
Deep learning methods operate in regimes that defy the traditional statistical mindset. The neural network architectures often contain more parameters than training samples, and are so rich that theyExpand
A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning
The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the goodExpand
Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel
TLDR
It is proved that for infinite-width two-layer nets, noisy gradient descent optimizes the regularized neural net loss to a global minimum in polynomial iterations. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 62 REFERENCES
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity. Expand
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
TLDR
The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data. Expand
Learning Multiple Layers of Features from Tiny Images
TLDR
It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network. Expand
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
TLDR
A case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but are in fact connected through their flat region and so belong to the same basin. Expand
Theory of Deep Learning III: explaining the non-overfitting puzzle
TLDR
It is shown that the dynamics associated to gradient descent minimization of nonlinear networks is topologically equivalent, near the asymptotically stable minima of the empirical error, to linear gradient system in a quadratic potential with a degenerate (for square loss) or almost degenerated (for logistic or crossentropy loss) Hessian. Expand
The Implicit Bias of Gradient Descent on Separable Data
We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of theExpand
Diving into the shallows: a computational perspective on large-scale shallow learning
TLDR
EigenPro iteration is introduced, based on a preconditioning scheme using a small number of approximately computed eigenvectors, which turns out that injecting this small (computationally inexpensive and SGD-compatible) amount of approximate second-order information leads to major improvements in convergence. Expand
Deep Image Prior
TLDR
It is shown that a randomly-initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, super-resolution, and inpainting. Expand
Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks
TLDR
This work shows that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs). Expand
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
TLDR
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time. Expand
...
1
2
3
4
5
...