• Publications
  • Influence
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.
On Exact Computation with an Infinitely Wide Neural Net
The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.
Gradient Descent Finds Global Minima of Deep Neural Networks
The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet) and extends the analysis to deep residual convolutional neural networks and obtains a similar convergence result.
Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima
We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j
Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels
A new class of graph kernels, Graph Neural Tangent Kernels (GNTKs), which correspond to infinitely wide multi-layer GNNs trained by gradient descent are presented, which enjoy the full expressive power ofGNNs and inherit advantages of GKs.
Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced
It is rigorously proved that gradient flow effectively enforces the differences between squared norms across different layers to remain invariant without any explicit regularization, which implies that if the weights are initially small, gradient flow automatically balances the magnitudes of all layers.
Provably efficient RL with Rich Observations via Latent State Decoding
This work demonstrates how to estimate a mapping from the observations to latent states inductively through a sequence of regression and clustering steps inductively and uses it to construct good exploration policies.
On the Power of Over-parametrization in Neural Networks with Quadratic Activation
  • S. Du, J. Lee
  • Computer Science, Mathematics
  • 3 March 2018
Despite that the number of parameters may exceed the sample size, using theory of Rademacher complexity, it is shown with weight decay, the solution also generalizes well if the data is sampled from a regular distribution such as Gaussian.
Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks
Results suggesting neural tangent kernels perform strongly on low-data tasks are reported, with comparing the performance of NTK with the finite-width net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis.