• Corpus ID: 3566398

Stronger generalization bounds for deep nets via a compression approach

@inproceedings{Arora2018StrongerGB,
  title={Stronger generalization bounds for deep nets via a compression approach},
  author={Sanjeev Arora and Rong Ge and Behnam Neyshabur and Yi Zhang},
  booktitle={ICML},
  year={2018}
}
Deep nets generalize well despite having more parameters than the number of training samples. Recent works try to give an explanation using PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting. The current paper shows generalization bounds that're orders of magnitude better in practice. These rely upon new succinct reparametrizations of the trained net --- a compression that is explicit and efficient. These yield… 
Compression Implies Generalization
TLDR
A compression-based framework is established that is simple and powerful enough to extend the generalization bounds by Arora et al. to also hold for the original network and allows for simple proofs of the strongest known generalization limits for other popular machine learning models, namely Support Vector Machines and Boosting.
Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach
TLDR
This paper provides the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem and establishes an absolute limit on expected compressibility as a function of expected generalization error.
Uniform convergence may be unable to explain generalization in deep learning
TLDR
Through numerous experiments, doubt is cast on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well.
Generalization bounds via distillation
TLDR
This paper theoretically investigates the following empirical phenomenon: given a high-complexity network with poor generalization bounds, one can distill it into a network with nearly identical predictions but low complexity and vastly smaller generalization limits, as well as a variety of experiments demonstrating similar generalization performance between the original network and its distillation.
Compressibility and Generalization in Large-Scale Deep Learning
TLDR
This paper provides the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem and establishes an absolute limit on expected compressibility as a function of expected generalization error.
Towards a Theoretical Understanding of Hashing-Based Neural Nets
TLDR
This paper introduces a neural net compression scheme based on random linear sketching, and shows that the sketched network is able to approximate the original network on all input data coming from any smooth well-conditioned low-dimensional manifold, implying that the parameters in HashedNets can be provably recovered.
Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network
TLDR
A unified frame-work is given that can convert compression based bounds to those for non-compressed original networks and gives a data dependent generalization error bound which gives a tighter evaluation than the data independent ones.
Generalization bounds for deep learning
TLDR
Desiderata for techniques that predict generalization errors for deep learning models in supervised learning are introduced, and a marginal-likelihood PAC-Bayesian bound is derived that fulfills desiderata 1-3 and 5.
Generalization Bounds for Neural Networks: Kernels, Symmetry, and Sample Compression
TLDR
A reparameterization of DNNs is presented as a linear function of a feature map that is locally independent of the weights that transforms depth-dependencies into simple tensor products and maps each input to a discrete subset of the feature space.
Improved Sample Complexities for Deep Networks and Robust Classification via an All-Layer Margin
TLDR
This work analyzes a new notion of margin, which it is revealed has a clear and direct relationship with generalization for deep models, and presents a theoretically inspired training algorithm for increasing the all-layer margin.
...
...

References

SHOWING 1-10 OF 41 REFERENCES
Sharp Minima Can Generalize For Deep Nets
TLDR
It is argued that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and when focusing on deep networks with rectifier units, the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit is exploited.
Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data
TLDR
By optimizing the PAC-Bayes bound directly, Langford and Caruana (2001) are able to extend their approach and obtain nonvacuous generalization bounds for deep stochastic neural network classifiers with millions of parameters trained on only tens of thousands of examples.
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
TLDR
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Relating Data Compression and Learnability
TLDR
It is demonstrated that the existence of a suitable data compression scheme is sufficient to ensure learnability and the introduced compression scheme provides a rigorous model for studying data compression in connection with machine learning.
Train faster, generalize better: Stability of stochastic gradient descent
We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically
Fisher-Rao Metric, Geometry, and Complexity of Neural Networks
TLDR
An analytical characterization of the new Fisher-Rao norm is discovered, through which it is shown that the new measure serves as an umbrella for several existing norm-based complexity measures and establishes norm-comparison inequalities.
Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation
In this article we prove sanity-check bounds for the error of the leave-oneout cross-validation estimate of the generalization error: that is, bounds showing that the worst-case error of this
On the importance of single directions for generalization
TLDR
It is found that class selectivity is a poor predictor of task importance, suggesting not only that networks which generalize well minimize their dependence on individual units by reducing their selectivity, but also that individually selective units may not be necessary for strong network performance.
...
...