# A Priori Estimates of the Generalization Error for Two-layer Neural Networks

@article{Weinan2018APE,
title={A Priori Estimates of the Generalization Error for Two-layer Neural Networks},
author={E Weinan and Chao Ma and Lei Wu},
journal={ArXiv},
year={2018},
volume={abs/1810.06397}
}
• Published 27 September 2018
• Computer Science
• ArXiv
New estimates for the generalization error are established for the two-layer neural network model. [] Key Result Moreover, these bounds are equally effective in the over-parametrized regime when the network size is much larger than the size of the dataset.

## Figures and Tables from this paper

• Mathematics, Computer Science
ArXiv
• 2019
Optimal a priori estimates are derived for the population risk, also known as the generalization error, of a regularized residual network model, which treats the skip connections and the nonlinearities differently so that paths with more non linearities are regularized by larger weights.
• Computer Science, Mathematics
ArXiv
• 2019
It is proved that for all three models, the generalization error for the minimum-norm solution is comparable to the Monte Carlo rate, up to some logarithmic terms, as long as the models are sufficiently over-parametrized.
• Computer Science
Science China Mathematics
• 2020
In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels.
• Computer Science, Mathematics
• 2020
The obtained approximation and estimation rates are independent of the dimension of the input, showing that the curse of dimension can be overcome in this setting; in fact, the input dimension only enters in the form of a polynomial factor.
• Computer Science
ArXiv
• 2020
It is shown that the depth of the neural network only needs to increase much slower in order to obtain the same rate of approximation as an arbitrary stochastic optimization algorithm with i.i.d.\ random initializations.
• Computer Science
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
• 2020
Assuming the existence of the underlying groundtruth encoder and decoder, a priori estimates of the generalization error for autoencoders when an appropriately chosen regularization term is applied are established.
We describe a necessary and sufficient condition for the convergence to minimum Bayes risk when training two-layer ReLU-networks by gradient descent in the mean field regime with omni-directional
• Computer Science
ArXiv
• 2019
It is proved that in the over-parametrized regime, for a suitable initialization, with high probability GD can find a global minimum exponentially fast and it is shown that the GD path is uniformly close to the functions given by the related random feature model.
• Computer Science
ArXiv
• 2019
An effective model of linear F-Principle (LFP) dynamics is proposed which accurately predicts the learning results of two-layer ReLU neural networks (NNs) of large widths and is rationalized by a linearized mean field residual dynamics of NNs.
• Computer Science
ArXiv
• 2019
This work proves an \emph{a priori} generalization error bound of two-layer ReLU NNs, which implies that NN does not suffer from curse of dimensionality, and a smallgeneralization error can be achieved without requiring exponentially large number of neurons.

## References

SHOWING 1-10 OF 59 REFERENCES

• Mathematics, Computer Science
ArXiv
• 2019
Optimal a priori estimates are derived for the population risk, also known as the generalization error, of a regularized residual network model, which treats the skip connections and the nonlinearities differently so that paths with more non linearities are regularized by larger weights.
• Computer Science
Science China Mathematics
• 2020
In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels.
• Computer Science
ICLR
• 2019
A novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks and a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks are presented.
• Computer Science, Mathematics
• 2018
A generalization bound is presented for feedforward neural networks with ReLU activations in terms of the product of the spectral norm of the layers and the Frobenius norm of its weights, thereby bounding the sharpness of the network.
• Computer Science
ArXiv
• 2019
It is proved that in the over-parametrized regime, for a suitable initialization, with high probability GD can find a global minimum exponentially fast and it is shown that the GD path is uniformly close to the functions given by the related random feature model.
• Computer Science
ArXiv
• 2019
It is proved that under certain assumption on the data distribution that is milder than linear separability, gradient descent with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small expected error, leading to an algorithmic-dependent generalization error bound for deep learning.
• Computer Science
NIPS
• 2017
This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and that the presented bound is sensitive to this complexity.
• Computer Science
ICLR
• 2018
This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model.
• Computer Science
NeurIPS
• 2018
It is proved that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels, when the data comes from mixtures of well-separated distributions.
• Computer Science
NeurIPS
• 2018
It is shown that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers and involves Wasserstein gradient flows, a by-product of optimal transport theory.