• Corpus ID: 235313427

Optimization Variance: Exploring Generalization Properties of DNNs

@article{Zhang2021OptimizationVE,
  title={Optimization Variance: Exploring Generalization Properties of DNNs},
  author={Xiao Zhang and Dongrui Wu and Haoyi Xiong and Bo Dai},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.01714}
}
Unlike the conventional wisdom in statistical learning theory, the test error of a deep neural network (DNN) often demonstrates double descent: as the model complexity increases, it first follows a classical U-shaped curve and then shows a second descent. Through bias-variance decomposition, recent studies revealed that the bell-shaped variance is the major cause of model-wise double descent (when the DNN is widened gradually). This paper investigates epoch-wise double descent, i.e., the test… 

Regularization-wise double descent: Why it occurs and how to eliminate it

TLDR
It is found that for linear regression, a double descent shaped risk is caused by a superposition of bias-variance tradeoffs corresponding to different parts of the model and can be mitigated by scaling the regularization strength of each part appropriately.

Disparity Between Batches as a Signal for Early Stopping

We propose a metric for evaluating the generalization ability of deep neural networks trained with mini-batch gradient descent. Our metric, called gradient disparity, is the $\ell_2$ norm distance

Improving Meta-Learning Generalization with Activation-Based Early-Stopping

TLDR
Activation Based Early-st stopping (ABE) is proposed, an alternative to using validation-based early-stopping for meta-learning, and results show that the method validation- based early-Stopping, for all three algorithms, can vary significantly.

References

SHOWING 1-10 OF 40 REFERENCES

Early Stopping in Deep Networks: Double Descent and How to Eliminate it

TLDR
Inspired by this theory, two standard convolutional networks are studied empirically and it is shown that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.

Scaling description of generalization with number of parameters in deep learning

TLDR
This work relies on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations of the neural net output function f N around its expectation, which affects the generalization error for classification.

A Convergence Theory for Deep Learning via Over-Parameterization

TLDR
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.

A Modern Take on the Bias-Variance Tradeoff in Neural Networks

TLDR
It is found that both bias and variance can decrease as the number of parameters grows, and a new decomposition of the variance is introduced to disentangle the effects of optimization and data sampling.

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

TLDR
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.

SGD on Neural Networks Learns Functions of Increasing Complexity

TLDR
Key to the work is a new measure of how well one classifier explains the performance of another, based on conditional mutual information, which can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime.

Rethinking Bias-Variance Trade-off for Generalization of Neural Networks

TLDR
This work measures the bias and variance of neural networks and finds that deeper models decrease bias and increase variance for both in-dist distribution and out-of-distribution data, and corroborates these empirical results with a theoretical analysis of two-layer linear networks with random first layer.

High-dimensional dynamics of generalization error in neural networks

Rethink the Connections among Generalization, Memorization and the Spectral Bias of DNNs

TLDR
It is shown that under the experimental setup of deep double descent, the high-frequency components of DNNs begin to diminish in the second descent, whereas the examples with random labels are still being memorized, and the spectrum of Dnns can be applied to monitoring the test behavior.

Understanding deep learning requires rethinking generalization

TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.