• Corpus ID: 221761367

Distributional Generalization: A New Kind of Generalization

  title={Distributional Generalization: A New Kind of Generalization},
  author={Preetum Nakkiran and Yamini Bansal},
We introduce a new notion of generalization -- Distributional Generalization -- which roughly states that outputs of a classifier at train and test time are close *as distributions*, as opposed to close in just their average error. For example, if we mislabel 30% of dogs as cats in the train set of CIFAR-10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats on the *test set* as well, while leaving other classes unaffected. This behavior is not captured by… 
Deconstructing Distributions: A Pointwise Framework of Learning
This work studies a point’s profile : the relationship between models’ average performance on the test distribution and their pointwise performance on this individual point, and finds that profiles can yield new insights into the structure of both models and data—in and out-of-distribution.
Knowledge Distillation: Bad Models Can Be Good Role Models
It is proved that distillation from samplers is guaranteed to produce a student which approximates the Bayes optimal classifier, and it is shown that some common learning algorithms (e.g., Nearest-Neighbours and Kernel Machines) can generate sampler when applied in the overparameterized regime.
A Note on "Assessing Generalization of SGD via Disagreement"
This paper presents empirical evidence and theoretical evidence that the average test error of deep neural networks can be estimated via the prediction disagreement of two separately trained networks, and shows that the approach suggested might be impractical because a deep ensemble’s calibration deteriorates under distribution shift.
Assessing Generalization of SGD via Disagreement
We empirically show that the test error of deep networks can be estimated by training the same architecture on the same training set but with two different runs of Stochastic Gradient Descent (SGD),
Datamodels: Predicting Predictions from Training Data
It is shown that even simple linear datamodels can successfully predict model outputs and give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space.
Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting
This work argues that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly : modest noise in the training set causes nonzero excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime.
The Deep Bootstrap: Good Online Learners are Good Offline Generalizers
A new framework for reasoning about generalization in deep learning is proposed, and empirical evidence that this gap between worlds can be small in realistic deep learning settings, in particular supervised image classification is given.
Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift
Whenever accuracy-on-the-line holds, the OOD agreement between the predictions of any two pairs of neural networks also observes a strong linear correlation with their ID agreement, and the prediction algorithm outperforms previous methods both in shifts where agreement-on the line holds and when accuracy is not on the line.
Early-stopped neural networks are consistent
This work studies the behavior of shallow ReLU networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the
Predicting Out-of-Distribution Error with the Projection Norm
This work proposes a metric—Projection Norm—to predict a model’s performance on out-of-distribution (OOD) data without access to ground truth labels and finds that Projection Norm is the only approach that achieves non-trivial detection performance on adversarial examples.


Uniform convergence may be unable to explain generalization in deep learning
Through numerous experiments, doubt is cast on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well.
To understand deep learning we need to understand kernel learning
It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods.
Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss
It is shown that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions.
Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data
By optimizing the PAC-Bayes bound directly, Langford and Caruana (2001) are able to extend their approach and obtain nonvacuous generalization bounds for deep stochastic neural network classifiers with millions of parameters trained on only tens of thousands of examples.
Learning with Noisy Labels
The problem of binary classification in the presence of random classification noise is theoretically studied—the learner sees labels that have independently been flipped with some small probability, and methods used in practice such as biased SVM and weighted logistic regression are provably noise-tolerant.
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate
A theoretical foundation for interpolated classifiers is taken by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes, and consistency or near-consistency is proved for these schemes in classification and regression problems.
Learning Not to Learn in the Presence of Noisy Labels
It is shown that a new class of loss functions called the gambler's loss provides strong robustness to label noise across various levels of corruption, resulting in a simple and effective method to improve robustness and generalization.
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.
Learning Multiple Layers of Features from Tiny Images
It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples.