• Corpus ID: 222066778

Deep Equals Shallow for ReLU Networks in Kernel Regimes

@article{Bietti2021DeepES,
  title={Deep Equals Shallow for ReLU Networks in Kernel Regimes},
  author={Alberto Bietti and Francis R. Bach},
  journal={ArXiv},
  year={2021},
  volume={abs/2009.14397}
}
Deep networks are often considered to be more expressive than shallow ones in terms of approximation. Indeed, certain functions can be approximated by deep networks provably more efficiently than by shallow ones, however, no tractable algorithms are known for learning such deep models. Separately, a recent line of work has shown that deep networks trained with gradient descent may behave like (tractable) kernel methods in a certain over-parameterized regime, where the kernel is determined by… 

Figures and Tables from this paper

Spectral Analysis of the Neural Tangent Kernel for Deep Residual Networks

TLDR
A spectral analysis of massively over-parameterized, fully connected residual networks with ReLU activation through their respective neural tangent kernels (ResNTK) shows that, much like NTK for fully connected networks (FC-NTK), for input distributed uniformly on the hypersphere Sd−1, the eigenfunctions of ResNTK are the spherical harmonics and the eigens decay polynomially with frequency k as k−d.

On Approximation in Deep Convolutional Networks: a Kernel Perspective

TLDR
It is found that while expressive kernels operating on input patches are important at the first layer, simpler polynomial kernels can suffice in higher layers for good performance, and a precise functional description of the RKHS and its regularization properties is provided.

How Wide Convolutional Neural Networks Learn Hierarchical Tasks

TLDR
It is shown that the spectrum of the corresponding kernel and its asymptotics inherit the hierarchical structure of the network, which implies that despite their hierarchical structure, the functions generated by deep CNNs are too rich to be efficiently learnable in high dimension.

Graph Neural Network Bandits

TLDR
It is shown that graph neural networks (GNNs) can be used to estimate the reward function, assuming it resides in the Reproducing Kernel Hilbert Space of a permutation-invariant additive kernel, and a novel connection between such kernels and the graph neural tangent kernel is established.

Learning sparse features can lead to overfitting in neural networks

TLDR
It is shown that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation, and it is empirically shown that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors.

Uniform Generalization Bounds for Overparameterized Neural Networks

TLDR
Adopting the recently developed Neural Tangent (NT) kernel theory, uniform generalization bounds for overparameterized neural networks in kernel regimes are proved, when the true data generating model belongs to the reproducing kernel Hilbert space (RKHS) corresponding to the NT kernel.

What can be learnt with wide convolutional networkds?

TLDR
Interestingly, it is found that despite their hierarchical structure, the functions generated by deep CNNs are too rich to be efficiently learnable in high dimension.

The Curse of Depth in Kernel Regime

TLDR
It is shown that the large depth limit of this regime is unexpectedly trivial, and the convergence rate to this trivial regime is fully characterize.

Generalization Properties of NAS under Activation and Skip Connection Search

TLDR
This work derives the lower (and upper) bounds of the minimum eigenvalue of Neural Tangent Kernel under the (in)finite width regime from a search space including mixed activation functions, fully connected, and residual neural networks, and leverages the eigen Value bounds to establish generalization error bounds of NAS in the stochastic gradient descent training.

Deep learning theory (DRAFT)

References

SHOWING 1-10 OF 61 REFERENCES

Deep vs. shallow networks : An approximation theory perspective

TLDR
A new definition of relative dimension is proposed to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning.

A Convergence Theory for Deep Learning via Over-Parameterization

TLDR
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.

To understand deep learning we need to understand kernel learning

TLDR
It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods.

Towards Understanding Hierarchical Learning: Benefits of Neural Representations

TLDR
This work demonstrates that intermediate neural representations add more flexibility to neural networks and can be advantageous over raw inputs, and may provide a new perspective on why depth is important in deep learning.

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

TLDR
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.

On Lazy Training in Differentiable Programming

TLDR
This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.

Gradient descent optimizes over-parameterized deep ReLU networks

TLDR
The key idea of the proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent.

Deep Neural Networks as Gaussian Processes

TLDR
The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks.

On the Inductive Bias of Neural Tangent Kernels

TLDR
This work studies smoothness, approximation, and stability properties of functions with finite norm, including stability to image deformations in the case of convolutional networks, and compares to other known kernels for similar architectures.

On the Similarity between the Laplace and Neural Tangent Kernels

TLDR
It is shown that NTK for fully connected networks is closely related to the standard Laplace kernel, and theoretically that for normalized data on the hypersphere both kernels have the same eigenfunctions and their eigenvalues decay polynomially at the same rate, implying that their Reproducing Kernel Hilbert Spaces (RKHS) include the same sets of functions.
...