• Corpus ID: 208910128

In Defense of Uniform Convergence: Generalization via derandomization with an application to interpolating predictors

@inproceedings{Negrea2020InDO,
  title={In Defense of Uniform Convergence: Generalization via derandomization with an application to interpolating predictors},
  author={Jeffrey Negrea and Gintare Karolina Dziugaite and Daniel M. Roy},
  booktitle={ICML},
  year={2020}
}
We propose to study the generalization error of a learned predictor $\hat h$ in terms of that of a surrogate (potentially randomized) predictor that is coupled to $\hat h$ and designed to trade empirical risk for control of generalization error. In the case where $\hat h$ interpolates the data, it is interesting to consider theoretical surrogate classifiers that are partially derandomized or rerandomized, e.g., fit to the training data but with modified label noise. We also show that replacing… 

Figures from this paper

Structure from Randomness in Halfspace Learning with the Zero-One Loss
TLDR
The results suggest that the study of compressive learning can improve the understanding of which benign structural traits – if they are possessed by the data generator – make it easier to learn an effective classifier from a sample.
Optimistic Rates: A Unifying Theory for Interpolation Learning and Regularization in Linear Regression
TLDR
The optimistic rate bound is studied for linear regression with Gaussian data to recover some classical statistical guarantees for ridge and LASSO regression under random designs, and helps to obtain a precise understanding of the excess risk of near-interpolators in the over-parameterized regime.
Generalization bounds for deep learning
TLDR
Desiderata for techniques that predict generalization errors for deep learning models in supervised learning are introduced, and a marginal-likelihood PAC-Bayesian bound is derived that fulfills desiderata 1-3 and 5.
The Implicit Bias of Benign Overfitting
TLDR
This paper proposes a prototypical and rather generic data model for benign overfitting of linear predictors, where an arbitrary input distribution of some fixed dimension k is concatenated with a high-dimensional distribution and proves that the max-margin predictor is asymptotically biased towards minimizing a weighted squared hinge loss.
Towards Understanding Generalization via Decomposing Excess Risk Dynamics
TLDR
Inspired by the observation that neural networks show a slow convergence rate when fitting noise, this work proposes decomposing the excess risk dynamics and applying stability-based bound only on the variance part (which measures how the model performs on pure noise), and provides two applications for the framework.
Uniform Convergence, Adversarial Spheres and a Simple Remedy
TLDR
It is proved that the Neural Tangent Kernel (NTK) also suffers from the same phenomenon and its origin is uncovered and the important role of the output bias is highlighted and theoretically as well as empirically how a sensible choice completely mitigates the problem is highlighted.
Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting
We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an
Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data
TLDR
This work considers the generalization error of two-layer neural networks trained to interpolation by gradient descent on the logistic loss following random initialization and shows that in this setting, neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly matching any noisy training labels, and simultaneously achieve test error close to the Bayes-optimal error.
Generalization of GANs and overparameterized models under Lipschitz continuity
TLDR
Borders show that penalizing the Lipschitz constant of the GAN loss can improve generalization, and it is shown that, when using Dropout or spectral normalization, both truly deep neural networks and GANs can generalize well without the curse of dimensionality.
The Sample Complexity of One-Hidden-Layer Neural Networks
TLDR
It is proved that in general, controlling the spectral norm of the hidden layer weight matrix is insufficient to get uniform convergence guarantees (independent of the network width), while a stronger Frobenius norm control is sufficient, extending and improving on previous work.
...
1
2
3
...

References

SHOWING 1-10 OF 26 REFERENCES
Uniform convergence may be unable to explain generalization in deep learning
TLDR
Through numerous experiments, doubt is cast on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well.
To understand deep learning we need to understand kernel learning
TLDR
It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods.
Global Minima of DNNs: The Plenty Pantry
TLDR
This work shows that for certain, large hypotheses classes, some interpolating ERMs enjoy very good statistical guarantees while others fail in the worst sense, and shows that the same phenomenon occurs for DNNs with zero training error and sufficiently large architectures.
Benign overfitting in linear regression
TLDR
A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.
Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation
TLDR
Inspired by the theory, this work directly regularizes the network's Jacobians during training and empirically demonstrates that this improves test performance.
Surprises in High-Dimensional Ridgeless Least Squares Interpolation
TLDR
This paper recovers---in a precise quantitative way---several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.
Controlling Bias in Adaptive Data Analysis Using Information Theory
TLDR
A general information-theoretic framework to quantify and provably bound the bias and other statistics of an arbitrary adaptive analysis process is proposed, and it is proved that the mutual information based bound is tight in natural models.
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
Reconciling modern machine-learning practice and the classical bias–variance trade-off
TLDR
This work shows how classical theory and modern practice can be reconciled within a single unified performance curve and proposes a mechanism underlying its emergence, and provides evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets.
High-dimensional dynamics of generalization error in neural networks
...
1
2
3
...