Corpus ID: 222133914

Smaller generalization error derived for deep compared to shallow residual neural networks

  title={Smaller generalization error derived for deep compared to shallow residual neural networks},
  author={Aku Kammonen and Jonas Kiessling and P. Plech{\'a}{\vc} and M. Sandberg and A. Szepessy and R. Tempone},
Estimates of the generalization error are proved for a residual neural network with $L$ random Fourier features layers $\bar z_{\ell+1}=\bar z_\ell + \text{Re}\sum_{k=1}^K\bar b_{\ell k}e^{{\rm i}\omega_{\ell k}\bar z_\ell}+ \text{Re}\sum_{k=1}^K\bar c_{\ell k}e^{{\rm i}\omega'_{\ell k}\cdot x}$. An optimal distribution for the frequencies $(\omega_{\ell k},\omega'_{\ell k})$ of the random Fourier features $e^{{\rm i}\omega_{\ell k}\bar z_\ell}$ and $e^{{\rm i}\omega'_{\ell k}\cdot x}$ is… Expand

Figures and Tables from this paper


Adaptive random Fourier features with Metropolis sampling
This adaptive, non-parametric stochastic method leads asymptotically, as $K\to\infty$, to equidistributed amplitudes $|\hat\beta_k|$, analogous to deterministic adaptive algorithms for differential equations. Expand
Benefits of Depth in Neural Networks
This result is proved here for a class of nodes termed "semi-algebraic gates" which includes the common choices of ReLU, maximum, indicator, and piecewise polynomial functions, therefore establishing benefits of depth against not just standard networks with ReLU gates, but also convolutional networks with reLU and maximization gates, sum-product networks, and boosted decision trees. Expand
The power of deeper networks for expressing natural functions
It is proved that the total number of neurons required to approximate natural classes of multivariate polynomials of multivariable variables grows only linearly with $n$ for deep neural networks, but grows exponentially when merely a single hidden layer is allowed. Expand
Understanding the difficulty of training deep feedforward neural networks
The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. Expand
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
Understanding Machine Learning - From Theory to Algorithms
The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way in an advanced undergraduate or beginning graduate course. Expand
Densely Connected Convolutional Networks
The Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion, and has several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. Expand
Universal approximation bounds for superpositions of a sigmoidal function
  • A. Barron
  • Mathematics, Computer Science
  • IEEE Trans. Inf. Theory
  • 1993
The approximation rate and the parsimony of the parameterization of the networks are shown to be advantageous in high-dimensional settings and the integrated squared approximation error cannot be made smaller than order 1/n/sup 2/d/ uniformly for functions satisfying the same smoothness assumption. Expand
On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions
  • F. Bach
  • Mathematics, Computer Science
  • J. Mach. Learn. Res.
  • 2017
Theoretical analysis of the number of required samples for a given approximation error leads to both upper and lower bounds that are based solely on the eigenvalues of the associated integral operator and match up to logarithmic terms. Expand
A mean-field optimal control formulation of deep learning
This paper introduces the mathematical formulation of the population risk minimization problem in deep learning as a mean-field optimal control problem, and state and prove optimality conditions of both the Hamilton–Jacobi–Bellman type and the Pontryagin type. Expand