# Smaller generalization error derived for deep compared to shallow residual neural networks

@article{Kammonen2020SmallerGE, title={Smaller generalization error derived for deep compared to shallow residual neural networks}, author={Aku Kammonen and Jonas Kiessling and P. Plech{\'a}{\vc} and M. Sandberg and A. Szepessy and R. Tempone}, journal={ArXiv}, year={2020}, volume={abs/2010.01887} }

Estimates of the generalization error are proved for a residual neural network with $L$ random Fourier features layers
$\bar z_{\ell+1}=\bar z_\ell + \text{Re}\sum_{k=1}^K\bar b_{\ell k}e^{{\rm i}\omega_{\ell k}\bar z_\ell}+ \text{Re}\sum_{k=1}^K\bar c_{\ell k}e^{{\rm i}\omega'_{\ell k}\cdot x}$. An optimal distribution for the frequencies $(\omega_{\ell k},\omega'_{\ell k})$ of the random Fourier features $e^{{\rm i}\omega_{\ell k}\bar z_\ell}$ and $e^{{\rm i}\omega'_{\ell k}\cdot x}$ is… Expand

#### References

SHOWING 1-10 OF 16 REFERENCES

Adaptive random Fourier features with Metropolis sampling

- Mathematics, Computer Science
- ArXiv
- 2020

This adaptive, non-parametric stochastic method leads asymptotically, as $K\to\infty$, to equidistributed amplitudes $|\hat\beta_k|$, analogous to deterministic adaptive algorithms for differential equations. Expand

Benefits of Depth in Neural Networks

- Mathematics, Computer Science
- COLT
- 2016

This result is proved here for a class of nodes termed "semi-algebraic gates" which includes the common choices of ReLU, maximum, indicator, and piecewise polynomial functions, therefore establishing benefits of depth against not just standard networks with ReLU gates, but also convolutional networks with reLU and maximization gates, sum-product networks, and boosted decision trees. Expand

The power of deeper networks for expressing natural functions

- Mathematics, Computer Science
- ICLR
- 2018

It is proved that the total number of neurons required to approximate natural classes of multivariate polynomials of multivariable variables grows only linearly with $n$ for deep neural networks, but grows exponentially when merely a single hidden layer is allowed. Expand

Understanding the difficulty of training deep feedforward neural networks

- Computer Science, Mathematics
- AISTATS
- 2010

The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. Expand

Adam: A Method for Stochastic Optimization

- Computer Science, Mathematics
- ICLR
- 2015

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand

Understanding Machine Learning - From Theory to Algorithms

- Computer Science
- 2014

The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way in an advanced undergraduate or beginning graduate course. Expand

Densely Connected Convolutional Networks

- Computer Science
- 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017

The Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion, and has several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. Expand

Universal approximation bounds for superpositions of a sigmoidal function

- Mathematics, Computer Science
- IEEE Trans. Inf. Theory
- 1993

The approximation rate and the parsimony of the parameterization of the networks are shown to be advantageous in high-dimensional settings and the integrated squared approximation error cannot be made smaller than order 1/n/sup 2/d/ uniformly for functions satisfying the same smoothness assumption. Expand

On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2017

Theoretical analysis of the number of required samples for a given approximation error leads to both upper and lower bounds that are based solely on the eigenvalues of the associated integral operator and match up to logarithmic terms. Expand

A mean-field optimal control formulation of deep learning

- Mathematics, Computer Science
- ArXiv
- 2018

This paper introduces the mathematical formulation of the population risk minimization problem in deep learning as a mean-field optimal control problem, and state and prove optimality conditions of both the Hamilton–Jacobi–Bellman type and the Pontryagin type. Expand