This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.Expand

The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.Expand

Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.Expand

The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet) and extends the analysis to deep residual convolutional neural networks and obtains a similar convergence result.Expand

We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j… Expand

A new class of graph kernels, Graph Neural Tangent Kernels (GNTKs), which correspond to infinitely wide multi-layer GNNs trained by gradient descent are presented, which enjoy the full expressive power ofGNNs and inherit advantages of GKs.Expand

It is rigorously proved that gradient flow effectively enforces the differences between squared norms across different layers to remain invariant without any explicit regularization, which implies that if the weights are initially small, gradient flow automatically balances the magnitudes of all layers.Expand

This work demonstrates how to estimate a mapping from the observations to latent states inductively through a sequence of regression and clustering steps inductively and uses it to construct good exploration policies.Expand

Despite that the number of parameters may exceed the sample size, using theory of Rademacher complexity, it is shown with weight decay, the solution also generalizes well if the data is sampled from a regular distribution such as Gaussian.Expand

Results suggesting neural tangent kernels perform strongly on low-data tasks are reported, with comparing the performance of NTK with the finite-width net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis.Expand