• Corpus ID: 235313427

Optimization Variance: Exploring Generalization Properties of DNNs

@article{Zhang2021OptimizationVE,
  title={Optimization Variance: Exploring Generalization Properties of DNNs},
  author={Xiao Zhang and Dongrui Wu and Haoyi Xiong and Bo Dai},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.01714}
}
Unlike the conventional wisdom in statistical learning theory, the test error of a deep neural network (DNN) often demonstrates double descent: as the model complexity increases, it first follows a classical U-shaped curve and then shows a second descent. Through bias-variance decomposition, recent studies revealed that the bell-shaped variance is the major cause of model-wise double descent (when the DNN is widened gradually). This paper investigates epoch-wise double descent, i.e., the test… 
Improving Meta-Learning Generalization with Activation-Based Early-Stopping
TLDR
Activation Based Early-st stopping (ABE) is proposed, an alternative to using validation-based early-stopping for meta-learning, and results show that the method validation- based early-Stopping, for all three algorithms, can vary significantly.
Regularization-wise double descent: Why it occurs and how to eliminate it
TLDR
It is found that for linear regression, a double descent shaped risk is caused by a superposition of bias-variance tradeoffs corresponding to different parts of the model and can be mitigated by scaling the regularization strength of each part appropriately.
Disparity Between Batches as a Signal for Early Stopping
We propose a metric for evaluating the generalization ability of deep neural networks trained with mini-batch gradient descent. Our metric, called gradient disparity, is the $\ell_2$ norm distance

References

SHOWING 1-10 OF 40 REFERENCES
Early Stopping in Deep Networks: Double Descent and How to Eliminate it
TLDR
Inspired by this theory, two standard convolutional networks are studied empirically and it is shown that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.
A Modern Take on the Bias-Variance Tradeoff in Neural Networks
TLDR
It is found that both bias and variance can decrease as the number of parameters grows, and a new decomposition of the variance is introduced to disentangle the effects of optimization and data sampling.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
TLDR
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.
SGD on Neural Networks Learns Functions of Increasing Complexity
TLDR
Key to the work is a new measure of how well one classifier explains the performance of another, based on conditional mutual information, which can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime.
Rethinking Bias-Variance Trade-off for Generalization of Neural Networks
TLDR
This work measures the bias and variance of neural networks and finds that deeper models decrease bias and increase variance for both in-dist distribution and out-of-distribution data, and corroborates these empirical results with a theoretical analysis of two-layer linear networks with random first layer.
High-dimensional dynamics of generalization error in neural networks
Rethink the Connections among Generalization, Memorization and the Spectral Bias of DNNs
TLDR
It is shown that under the experimental setup of deep double descent, the high-frequency components of DNNs begin to diminish in the second descent, whereas the examples with random labels are still being memorized, and the spectrum of Dnns can be applied to monitoring the test behavior.
Deep double descent: where bigger models and more data hurt
TLDR
The notion of model complexity allows us to identify certain regimes where increasing the number of train samples actually hurts test performance, and defines a new complexity measure called the effective model complexity and conjecture a generalized double descent with respect to this measure.
Sharp Minima Can Generalize For Deep Nets
TLDR
It is argued that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and when focusing on deep networks with rectifier units, the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit is exploited.
Reconciling modern machine-learning practice and the classical bias–variance trade-off
TLDR
This work shows how classical theory and modern practice can be reconciled within a single unified performance curve and proposes a mechanism underlying its emergence, and provides evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets.
...
...