# A Priori Estimates of the Generalization Error for Two-layer Neural Networks

@article{Weinan2018APE, title={A Priori Estimates of the Generalization Error for Two-layer Neural Networks}, author={E Weinan and Chao Ma and Lei Wu}, journal={ArXiv}, year={2018}, volume={abs/1810.06397} }

New estimates for the generalization error are established for the two-layer neural network model. [] Key Result Moreover, these bounds are equally effective in the over-parametrized regime when the network size is much larger than the size of the dataset.

## 45 Citations

### A Priori Estimates of the Population Risk for Residual Networks

- Mathematics, Computer ScienceArXiv
- 2019

Optimal a priori estimates are derived for the population risk, also known as the generalization error, of a regularized residual network model, which treats the skip connections and the nonlinearities differently so that paths with more non linearities are regularized by larger weights.

### On the Generalization Properties of Minimum-norm Solutions for Over-parameterized Neural Network Models

- Computer Science, MathematicsArXiv
- 2019

It is proved that for all three models, the generalization error for the minimum-norm solution is comparable to the Monte Carlo rate, up to some logarithmic terms, as long as the models are sufficiently over-parametrized.

### A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics

- Computer ScienceScience China Mathematics
- 2020

In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels.

### Neural network approximation and estimation of classifiers with classification boundary in a Barron class

- Computer Science, Mathematics
- 2020

The obtained approximation and estimation rates are independent of the dimension of the input, showing that the curse of dimension can be overcome in this setting; in fact, the input dimension only enters in the form of a polynomial factor.

### Strong overall error analysis for the training of artificial neural networks via random initializations

- Computer ScienceArXiv
- 2020

It is shown that the depth of the neural network only needs to increase much slower in order to obtain the same rate of approximation as an arbitrary stochastic optimization algorithm with i.i.d.\ random initializations.

### A Priori Estimates of the Generalization Error for Autoencoders

- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020

Assuming the existence of the underlying groundtruth encoder and decoder, a priori estimates of the generalization error for autoencoders when an appropriately chosen regularization term is applied are established.

### On the Convergence of Gradient Descent Training for Two-layer ReLU-networks in the Mean Field Regime

- Computer ScienceArXiv
- 2020

We describe a necessary and sufficient condition for the convergence to minimum Bayes risk when training two-layer ReLU-networks by gradient descent in the mean field regime with omni-directional…

### Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections

- Computer ScienceArXiv
- 2019

It is proved that in the over-parametrized regime, for a suitable initialization, with high probability GD can find a global minimum exponentially fast and it is shown that the GD path is uniformly close to the functions given by the related random feature model.

### Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks

- Computer ScienceArXiv
- 2019

An effective model of linear F-Principle (LFP) dynamics is proposed which accurately predicts the learning results of two-layer ReLU neural networks (NNs) of large widths and is rationalized by a linearized mean field residual dynamics of NNs.

### A priori generalization error for two-layer ReLU neural network through minimum norm solution

- Computer ScienceArXiv
- 2019

This work proves an \emph{a priori} generalization error bound of two-layer ReLU NNs, which implies that NN does not suffer from curse of dimensionality, and a smallgeneralization error can be achieved without requiring exponentially large number of neurons.

## References

SHOWING 1-10 OF 59 REFERENCES

### A Priori Estimates of the Population Risk for Residual Networks

- Mathematics, Computer ScienceArXiv
- 2019

Optimal a priori estimates are derived for the population risk, also known as the generalization error, of a regularized residual network model, which treats the skip connections and the nonlinearities differently so that paths with more non linearities are regularized by larger weights.

### A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics

- Computer ScienceScience China Mathematics
- 2020

In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels.

### Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks

- Computer ScienceICLR
- 2019

A novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks and a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks are presented.

### SPECTRALLY-NORMALIZED MARGIN BOUNDS FOR NEURAL NETWORKS

- Computer Science, Mathematics
- 2018

A generalization bound is presented for feedforward neural networks with ReLU activations in terms of the product of the spectral norm of the layers and the Frobenius norm of its weights, thereby bounding the sharpness of the network.

### Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections

- Computer ScienceArXiv
- 2019

It is proved that in the over-parametrized regime, for a suitable initialization, with high probability GD can find a global minimum exponentially fast and it is shown that the GD path is uniformly close to the functions given by the related random feature model.

### A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

- Computer ScienceArXiv
- 2019

It is proved that under certain assumption on the data distribution that is milder than linear separability, gradient descent with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small expected error, leading to an algorithmic-dependent generalization error bound for deep learning.

### Spectrally-normalized margin bounds for neural networks

- Computer ScienceNIPS
- 2017

This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and that the presented bound is sensitive to this complexity.

### SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

- Computer ScienceICLR
- 2018

This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model.

### Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data

- Computer ScienceNeurIPS
- 2018

It is proved that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels, when the data comes from mixtures of well-separated distributions.

### On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

- Computer ScienceNeurIPS
- 2018

It is shown that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers and involves Wasserstein gradient flows, a by-product of optimal transport theory.