• Corpus ID: 244920772

A generalization gap estimation for overparameterized models via Langevin functional variance

@article{Okuno2021AGG,
  title={A generalization gap estimation for overparameterized models via Langevin functional variance},
  author={Akifumi Okuno and Keisuke Yano},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.03660}
}
This paper discusses the estimation of the generalization gap, the difference between a generalization error and an empirical error, for overparameterized models (e.g., neural networks). We first show that a functional variance, a key concept in defining a widely-applicable information criterion, characterizes the generalization gap even in overparameterized settings where a conventional theory cannot be applied. We also propose a computationally efficient approximation of the function variance… 

Tables from this paper

References

SHOWING 1-10 OF 49 REFERENCES
Asymptotic Risk of Overparameterized Likelihood Models: Double Descent Theory for Deep Neural Networks
TLDR
This study considers a likelihood maximization problem without the model constraints and analyzes the upper bound of an asymptotic risk of an estimator with penalization, and demonstrates that several explicit models, such as parallel deep neural networks and ensemble learning, are in agreement with the theory.
The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization
TLDR
This work provides a precise high-dimensional asymptotic analysis of generalization under kernel regression with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks optimized with gradient descent.
Stochastic Gradient Descent as Approximate Bayesian Inference
TLDR
It is demonstrated that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models and a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler is proposed.
Benign overfitting in linear regression
TLDR
A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.
Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach
TLDR
Novel statistics of FIM are revealed that are universal among a wide class of DNNs and can be connected to a norm-based capacity measure of generalization ability, and the potential usage of the derived statistics in learning strategies is demonstrated.
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
Approximation Analysis of Stochastic Gradient Langevin Dynamics by using Fokker-Planck Equation and Ito Process
TLDR
This work theoretically analyze the SGLD algorithm with constant stepsize and shows by using the Fokker-Planck equation that the probability distribution of random variables generated by the S GLD algorithm converges to the Bayesian posterior.
Scaling description of generalization with number of parameters in deep learning
TLDR
This work relies on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations that affect the generalization error of neural networks.
Theoretical guarantees for approximate sampling from smooth and log‐concave densities
TLDR
This work establishes non‐asymptotic bounds for the error of approximating the target distribution by the distribution obtained by the Langevin Monte Carlo method and its variants and illustrates the effectiveness of the established guarantees.
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate
TLDR
A theoretical foundation for interpolated classifiers is taken by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes, and consistency or near-consistency is proved for these schemes in classification and regression problems.
...
1
2
3
4
5
...