Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

  title={Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation},
  author={Mikhail Belkin},
  journal={Acta Numerica},
  pages={203 - 248}
  • Mikhail Belkin
  • Published 2021
  • Mathematics, Computer Science
  • Acta Numerica
In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges. However, the gap between theory and practice is gradually starting to close. In this paper I will attempt to assemble some pieces of the remarkable and still incomplete mathematical mosaic emerging from the efforts to understand the foundations of deep learning. The two key themes will be interpolation and its sibling over-parametrization… Expand
The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks
The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data.Expand
A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning
The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the goodExpand
How Can Increased Randomness in Stochastic Gradient Descent Improve Generalization?
Recent works report that increasing the learning rate or decreasing the minibatch size in stochastic gradient descent (SGD) can improve test set performance. We argue that this behavior is indeedExpand


Deep learning: a statistical viewpoint
This article surveys recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings, and focuses specifically on the linear regime for neural networks, where the network can be approximated by a linear model. Expand
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data. Expand
Uniform convergence may be unable to explain generalization in deep learning
Through numerous experiments, doubt is cast on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well. Expand
The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve
Deep learning methods operate in regimes that defy the traditional statistical mindset. The neural network architectures often contain more parameters than training samples, and are so rich that theyExpand
To understand deep learning we need to understand kernel learning
It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods. Expand
Do Deeper Convolutional Networks Perform Better?
This work analyzes the effect of increasing depth on test performance on CIFAR10 and ImageNet32 using ResNets and fully-convolutional networks and posit an explanation for this phenomenon by drawing intuition from the principle of minimum norm solutions in linear networks. Expand
Harmless interpolation of noisy data in regression
A bound on how well such interpolative solutions can generalize to fresh test data is given, and it is shown that this bound generically decays to zero with the number of extra features, thus characterizing an explicit benefit of overparameterization. Expand
Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning
This work shows that optimization problems corresponding to over-parameterized systems of non-linear equations are not convex, even locally, but instead satisfy the Polyak-Lojasiewicz (PL) condition allowing for efficient optimization by gradient descent or SGD. Expand
Understanding overfitting peaks in generalization error: Analytical risk curves for l2 and l1 penalized interpolation
  • P. Mitra
  • Computer Science, Physics
  • ArXiv
  • 2019
A generative and fitting model pair is introduced and it is shown that the overfitting peak can be dissociated from the point at which the fitting function gains enough dof's to match the data generative model and thus provides good generalization. Expand
Understanding deep learning requires rethinking generalization
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity. Expand