Neural tangent kernel: convergence and generalization in neural networks (invited paper)

@article{Jacot2018NeuralTK,
  title={Neural tangent kernel: convergence and generalization in neural networks (invited paper)},
  author={Arthur Jacot and Franck Gabriel and Cl{\'e}ment Hongler},
  journal={Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing},
  year={2018}
}
The Neural Tangent Kernel is a new way to understand the gradient descent in deep neural networks, connecting them with kernel methods. In this talk, I'll introduce this formalism and give a number of results on the Neural Tangent Kernel and explain how they give us insight into the dynamics of neural networks during training and into their generalization features. 

Figures from this paper

Spurious Local Minima of Deep ReLU Neural Networks in the Neural Tangent Kernel Regime
In this paper, we theoretically prove that the deep ReLU neural networks do not lie in spurious local minima in the loss landscape under the Neural Tangent Kernel (NTK) regime, that is, in the
A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks
TLDR
A generalized neural tangent kernel analysis is provided and it is shown that noisy gradient descent with weight decay can still exhibit a "kernel-like" behavior, which implies that the training loss converges linearly up to a certain accuracy.
On the Inductive Bias of Neural Tangent Kernels
TLDR
This work studies smoothness, approximation, and stability properties of functions with finite norm, including stability to image deformations in the case of convolutional networks, and compares to other known kernels for similar architectures.
Dynamics of Deep Neural Networks and Neural Tangent Hierarchy
TLDR
An infinite hierarchy of ordinary differential equations is derived, the neural tangent hierarchy (NTH) which captures the gradient descent dynamic of the deep neural network, and it is proved that the truncated hierarchy of NTH approximates theynamic of the NTK up to arbitrary precision.
Weighted Neural Tangent Kernel: A Generalized and Improved Network-Induced Kernel
TLDR
The Weighted Neural Tangent Kernel is introduced, a generalized and improved tool, which can capture an over-parameterized NN’s training dynamics under different optimizers, and the stability of the WNTK at initialization and during training is proved.
The Recurrent Neural Tangent Kernel
TLDR
This paper introduces and study the Recurrent Neural Tangent Kernel (RNTK), which sheds new insights into the behavior of overparametrized RNNs, including how different time steps are weighted by the RNTK to form the output under different initialization parameters and nonlinearity choices, and how inputs of different lengths are treated.
Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy (NTH)
TLDR
Dynamics of the NTK for finite width Deep Residual Network (ResNet) using the neural tangent hierarchy (NTH) proposed in Huang 2019 Dynamics suggests strongly that the particular skip-connection structure of ResNet is the main reason for its triumph over fully-connected network.
Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks
TLDR
It is shown that the performance of wide deep neural networks cannot be explained by the NTK regime and the impact of the initialization and the activation function on theNTK when the network depth becomes large is quantified.
Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections
TLDR
It is proved that in the over-parametrized regime, for a suitable initialization, with high probability GD can find a global minimum exponentially fast and it is shown that the GD path is uniformly close to the functions given by the related random feature model.
Neural Networks as Inter-Domain Inducing Points
TLDR
This paper cast the hidden units of finite-width neural networks as the inter-domain inducing points of a kernel, then a one-hidden-layer network becomes a kernel regression model.
...
...

References

SHOWING 1-10 OF 27 REFERENCES
Kernel Methods for Deep Learning
TLDR
A new family of positive-definite kernel functions that mimic the computation in large, multilayer neural nets are introduced that can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that the authors call multilayers kernel machines (MKMs).
Deep Neural Networks as Gaussian Processes
TLDR
The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks.
Understanding the difficulty of training deep feedforward neural networks
TLDR
The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
Gaussian Process Behaviour in Wide Deep Neural Networks
TLDR
It is shown that, under broad conditions, as the authors make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks.
Random Features for Large-Scale Kernel Machines
TLDR
Two sets of random features are explored, provided convergence bounds on their ability to approximate various radial basis kernels, and it is shown that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large- scale kernel machines.
Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity
TLDR
It is shown that initial representations generated by common random initializations are sufficiently rich to express all functions in the dual kernel space, and though the training objective is hard to optimize in the worst case, the initial weights form a good starting point for optimization.
Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach
TLDR
Novel statistics of FIM are revealed that are universal among a wide class of DNNs and can be connected to a norm-based capacity measure of generalization ability and quantitatively estimate an appropriately sized learning rate for gradient methods to converge.
A mean field view of the landscape of two-layer neural networks
TLDR
A compact description of the SGD dynamics is derived in terms of a limiting partial differential equation that allows for “averaging out” some of the complexities of the landscape of neural networks and can be used to prove a general convergence result for noisy SGD.
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
Multilayer feedforward networks are universal approximators
...
...