• Corpus ID: 235313427

# Optimization Variance: Exploring Generalization Properties of DNNs

@article{Zhang2021OptimizationVE,
title={Optimization Variance: Exploring Generalization Properties of DNNs},
author={Xiao Zhang and Dongrui Wu and Haoyi Xiong and Bo Dai},
journal={ArXiv},
year={2021},
volume={abs/2106.01714}
}
• Published 3 June 2021
• Computer Science
• ArXiv
Unlike the conventional wisdom in statistical learning theory, the test error of a deep neural network (DNN) often demonstrates double descent: as the model complexity increases, it first follows a classical U-shaped curve and then shows a second descent. Through bias-variance decomposition, recent studies revealed that the bell-shaped variance is the major cause of model-wise double descent (when the DNN is widened gradually). This paper investigates epoch-wise double descent, i.e., the test…
3 Citations

## Figures and Tables from this paper

Improving Meta-Learning Generalization with Activation-Based Early-Stopping
• Computer Science
ArXiv
• 2022
Activation Based Early-st stopping (ABE) is proposed, an alternative to using validation-based early-stopping for meta-learning, and results show that the method validation- based early-Stopping, for all three algorithms, can vary signiﬁcantly.
Regularization-wise double descent: Why it occurs and how to eliminate it
• Computer Science
ArXiv
• 2022
It is found that for linear regression, a double descent shaped risk is caused by a superposition of bias-variance tradeoffs corresponding to different parts of the model and can be mitigated by scaling the regularization strength of each part appropriately.
Disparity Between Batches as a Signal for Early Stopping
• Computer Science
ECML/PKDD
• 2021
We propose a metric for evaluating the generalization ability of deep neural networks trained with mini-batch gradient descent. Our metric, called gradient disparity, is the $\ell_2$ norm distance

## References

SHOWING 1-10 OF 40 REFERENCES
Early Stopping in Deep Networks: Double Descent and How to Eliminate it
• Computer Science
ICLR
• 2021
Inspired by this theory, two standard convolutional networks are studied empirically and it is shown that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.
A Modern Take on the Bias-Variance Tradeoff in Neural Networks
• Computer Science
ArXiv
• 2018
It is found that both bias and variance can decrease as the number of parameters grows, and a new decomposition of the variance is introduced to disentangle the effects of optimization and data sampling.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
• Computer Science
ICLR
• 2017
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.
SGD on Neural Networks Learns Functions of Increasing Complexity
• Computer Science
NeurIPS
• 2019
Key to the work is a new measure of how well one classifier explains the performance of another, based on conditional mutual information, which can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime.
Rethinking Bias-Variance Trade-off for Generalization of Neural Networks
• Computer Science
ICML
• 2020
This work measures the bias and variance of neural networks and finds that deeper models decrease bias and increase variance for both in-dist distribution and out-of-distribution data, and corroborates these empirical results with a theoretical analysis of two-layer linear networks with random first layer.
Rethink the Connections among Generalization, Memorization and the Spectral Bias of DNNs
• Computer Science
IJCAI
• 2021
It is shown that under the experimental setup of deep double descent, the high-frequency components of DNNs begin to diminish in the second descent, whereas the examples with random labels are still being memorized, and the spectrum of Dnns can be applied to monitoring the test behavior.
Deep double descent: where bigger models and more data hurt
• Computer Science
ICLR
• 2020
The notion of model complexity allows us to identify certain regimes where increasing the number of train samples actually hurts test performance, and defines a new complexity measure called the effective model complexity and conjecture a generalized double descent with respect to this measure.
Sharp Minima Can Generalize For Deep Nets
• Computer Science
ICML
• 2017
It is argued that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and when focusing on deep networks with rectifier units, the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit is exploited.
Reconciling modern machine-learning practice and the classical bias–variance trade-off
• Computer Science
Proceedings of the National Academy of Sciences
• 2019
This work shows how classical theory and modern practice can be reconciled within a single unified performance curve and proposes a mechanism underlying its emergence, and provides evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets.