• Corpus ID: 244729597

The Geometric Occam's Razor Implicit in Deep Learning

  title={The Geometric Occam's Razor Implicit in Deep Learning},
  author={Benoit Richard Umbert Dherin and Micheal Munn and David G. T. Barrett},
In over-parameterized deep neural networks there can be many possible parameter configurations that fit the training data exactly. However, the properties of these interpolating solutions are poorly understood. We argue that over-parameterized neural networks trained with stochastic gradient descent are subject to a Geometric Occam’s Razor; that is, these networks are implicitly regularized by the geometric model complexity. For one-dimensional regression, the geometric model complexity is… 

Figures from this paper

Manifold Characteristics That Predict Downstream Task Performance
It is shown that self-supervised methods learn an RM where alterations lead to large but constant size changes, indicating a smoother RM than fully supervised methods.


Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process
This work describes the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm of the gradient of the model with respect to the parameter vector, evaluated at each data point.
Understanding deep learning requires rethinking generalization
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
Gradient Regularization Improves Accuracy of Discriminative Models
It is demonstrated through experiments on real and synthetic tasks that stochastic gradient descent is unable to find locally optimal but globally unproductive solutions, and is forced to find solutions that generalize well.
Implicit bias of gradient descent for mean squared error regression with wide neural networks
It is shown that the solution of training a width-n shallow ReLU network is within $n^{- 1/2}$ of the function which fits the training data and whose difference from initialization has smallest 2-norm of the second derivative weighted by $1/\zeta$.
The Sobolev Regularization Effect of Stochastic Gradient Descent
This work considers high-order moments of the gradient noise, and shows that Stochastic Gradient Dascent (SGD) tends to impose constraints on these moments by a linear stability analysis of SGD around global minima.
Degrees of Freedom in Deep Neural Networks
It is shown that the degrees of freedom in these models is related to the expected optimism, which is the expected difference between test error and training error, and it is observed that for fixed number of parameters, deeper networks have less degrees offreedom exhibiting a regularization-by-depth.
Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation
Just as a physical prism separates colours mixed within a ray of light, the figurative prism of interpolation helps to disentangle generalization and optimization properties within the complex picture of modern machine learning.
Understanding Deep Neural Networks with Rectified Linear Units
The gap theorems hold for smoothly parametrized families of "hard" functions, contrary to countable, discrete families known in the literature, and a new lowerbound on the number of affine pieces is shown, larger than previous constructions in certain regimes of the network architecture.
On the Origin of Implicit Regularization in Stochastic Gradient Descent
It is proved that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite, but on a modified loss.
Training with Noise is Equivalent to Tikhonov Regularization
This paper shows that for the purposes of network training, the regularization term can be reduced to a positive semi-definite form that involves only first derivatives of the network mapping.