Corpus ID: 232170627

Robustness to Pruning Predicts Generalization in Deep Neural Networks

  title={Robustness to Pruning Predicts Generalization in Deep Neural Networks},
  author={Lorenz Kuhn and Clare Lyle and Aidan N. Gomez and Jonas Rothfuss and Y. Gal},
Existing generalization measures that aim to capture a model’s simplicity based on parameter counts or norms fail to explain generalization in overparameterized deep neural networks. In this paper, we introduce a new, theoretically motivated measure of a network’s simplicity which we call prunability: the smallest fraction of the network’s parameters that can be kept while pruning without adversely affecting its training loss. We show that this measure is highly predictive of a model’s… Expand

Figures and Tables from this paper

Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks
This study proves that the networks are guaranteed to be `p-compressible’, and the compression errors of different pruning techniques (magnitude, singular value, or node pruning) become arbitrarily small as the network size increases, and proves generalization bounds adapted to the theoretical framework confirm that the generalization error will be lower for more compressible networks. Expand
Sifting out the features by pruning: Are convolutional networks the winning lottery ticket of fully connected ones?
The ability of such automatic networksimplifying procedure to recover the key features “hand-crafted” in the design of CNNs suggests interesting applications to other datasets and tasks, in order to discover new and efficient architectural inductive biases. Expand


Predicting the Generalization Gap in Deep Networks with Margin Distributions
This paper proposes a measure based on the concept of margin distribution, which are the distances of training points to the decision boundary, and finds that it is necessary to use margin distributions at multiple layers of a deep network. Expand
Understanding deep learning requires rethinking generalization
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity. Expand
SNIP: Single-shot Network Pruning based on Connection Sensitivity
This work presents a new approach that prunes a given network once at initialization prior to training, and introduces a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task. Expand
Fantastic Generalization Measures and Where to Find Them
This work presents the first large scale study of generalization in deep networks, investigating more then 40 complexity measures taken from both theoretical bounds and empirical studies and showing surprising failures of some measures as well as promising measures for further research. Expand
To prune, or not to prune: exploring the efficacy of pruning for model compression
Across a broad range of neural network architectures, large-sparse models are found to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy. Expand
On the importance of single directions for generalization
It is found that class selectivity is a poor predictor of task importance, suggesting not only that networks which generalize well minimize their dependence on individual units by reducing their selectivity, but also that individually selective units may not be necessary for strong network performance. Expand
Generalization in Deep Networks: The Role of Distance from Initialization
Empirical evidences are provided that demonstrate that the model capacity of SGD-trained deep networks is in fact restricted through implicit regularization of the distance from initialization, and theoretical arguments that further highlight the need for initialization-dependent notions of model capacity are highlighted. Expand
Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach
This paper provides the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem and establishes an absolute limit on expected compressibility as a function of expected generalization error. Expand
Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data
By optimizing the PAC-Bayes bound directly, Langford and Caruana (2001) are able to extend their approach and obtain nonvacuous generalization bounds for deep stochastic neural network classifiers with millions of parameters trained on only tens of thousands of examples. Expand
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. Expand