Deep Double Descent via Smooth Interpolation

@article{Gamba2022DeepDD,
  title={Deep Double Descent via Smooth Interpolation},
  author={Matteo Gamba and Erik Englesson and Marten Bjorkman and Hossein Azizpour},
  journal={ArXiv},
  year={2022},
  volume={abs/2209.10080}
}
Overparameterized deep networks are known to be able to perfectly fit the training data while at the same time showing good generalization performance. A common paradigm drawn from intuition on linear regression suggests that large networks are able to interpolate even noisy data, without considerably deviating from the ground-truth signal. At present, a precise characterization of this phenomenon is missing. In this work, we present an empirical study of sharpness of the loss landscape of deep… 

Overparameterization Implicitly Regularizes Input-Space Smoothness

  • Matteo GambaHossein Azizpour
  • Computer Science
  • 2022
An empirical study of the Lipschitz constant of networks trained in practice, as the number of model parameters and training epochs vary, which highlights a theoretical shortcoming in modeling input-space smoothness via uniform bounds.

References

SHOWING 1-10 OF 36 REFERENCES

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.

Harmless interpolation of noisy data in regression

A bound on how well such interpolative solutions can generalize to fresh test data is given, and it is shown that this bound generically decays to zero with the number of extra features, thus characterizing an explicit benefit of overparameterization.

Provable Benefits of Overparameterization in Model Compression: From Double Descent to Pruning Neural Networks

The theory presented addresses the following core question: ``should one train a small model from the beginning, or first train a large model and then prune?'', and analytically identifies regimes in which, even if the location of the most informative features is known, the authors are better off fitting a large models and thenPruning rather than simply training with the known informative features.

Benign overfitting in linear regression

A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.

Sharpness-Aware Minimization for Efficiently Improving Generalization

This work introduces a novel, effective procedure for simultaneously minimizing loss value and loss sharpness, Sharpness-Aware Minimization (SAM), which improves model generalization across a variety of benchmark datasets and models, yielding novel state-of-the-art performance for several.

Sensitivity and Generalization in Neural Networks: an Empirical Study

It is found that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization.

Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

A theoretical foundation for interpolated classifiers is taken by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes, and consistency or near-consistency is proved for these schemes in classification and regression problems.

What Happens after SGD Reaches Zero Loss? -A Mathematical Framework

A general framework for analysis of the implicit bias of Stochastic Gradient Descent is given using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance.

A Universal Law of Robustness via Isoperimetry

It is proved that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly, and it is shown that smooth interpolation requires d times more parameters than mere interpolation.

Robust Overfitting may be mitigated by properly learned smoothening

Two empirical means to inject more learned smoothening during adversarially robust training of deep networks are investigated: one leveraging knowledge distillation and self-training to smooth the logits, the other performing stochastic weight averaging (Izmailov et al., 2018) to Smooth the weights.