# Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK

@article{Li2020LearningOT, title={Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK}, author={Yuanzhi Li and Tengyu Ma and Hongyang R. Zhang}, journal={ArXiv}, year={2020}, volume={abs/2007.04596} }

We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an over-parametrized two-layer neural network with ReLU activation, trained by gradient descent from random initialization, can provably…

## 29 Citations

On the Provable Generalization of Recurrent Neural Networks

- Computer ScienceNeurIPS
- 2021

A generalization error bound is proved to learn functions of input sequence with the form f(β [Xl1 , ..., XlN ]), which do not belong to the “additive” concept class, i,e.

ON LEARNING READ-ONCE DNFS WITH NEURAL NETWORKS

- Computer Science
- 2020

A computer assisted proof is used to prove the inductive bias for relatively small DNFs, and a process is designed to design a process for reconstructing the DNF from the learned network to better understand the resulting induction bias.

Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods

- Computer ScienceICLR
- 2021

It is shown that any linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting and so-called fast learning rate is obtained.

Local Signal Adaptivity: Provable Feature Learning in Neural Networks Beyond Kernels

- Computer ScienceNeurIPS
- 2021

The local signal adaptivity (LSA) phenomenon is proposed as one explanation for the superiority of neural networks over kernel methods in the image classification setting, based on finding a sparse signal in the presence of noise.

A Convergence Analysis of Gradient Descent on Graph Neural Networks

- Computer ScienceNeurIPS
- 2021

It is proved that for the case of deep linear GNNs gradient descent provably recovers solutions up to error in O(log(1/ )) iterations, under natural assumptions on the data distribution.

Efficiently Learning Any One Hidden Layer ReLU Network From Queries

- Computer ScienceArXiv
- 2021

This work gives the first polynomial-time algorithm for learning one hidden layer neural networks provided black-box access to the network, and it is shown that if F is an arbitrary onehidden layer neural network with ReLU activations, there is an algorithm with query complexity and running time that outputs a network F achieving low square loss relative to F with respect to the Gaussian measure.

Provable Acceleration of Neural Net Training via Polyak's Momentum

- Computer ScienceArXiv
- 2020

It is shown that Polyak's momentum, in combination with over-parameterization of the model, helps achieve faster convergence in training a one-layer ReLU network on $n$ examples.

Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent

- Computer ScienceNeurIPS
- 2021

A unified non-convex optimization framework for the analysis of neural network training is proposed and it is shown that stochastic gradient descent on objectives satisfying proxy convexity or the proxy Polyak-Lojasiewicz inequality leads to efficient guarantees for proxy objective functions.

Efficiently Learning One Hidden Layer Neural Networks From Queries

- Computer Science
- 2021

This work gives the first polynomial-time algorithm for learning one hidden layer neural networks provided black-box access to the network, and it is shown that if F is an arbitrary onehidden layer neural network with ReLU activations, there is an algorithm with query complexity and running time that outputs a network F achieving low square loss relative to F with respect to the Gaussian measure.

## References

SHOWING 1-10 OF 66 REFERENCES

Learning Two Layer Rectified Neural Networks in Polynomial Time

- Computer Science, MathematicsCOLT
- 2019

This work develops algorithms and hardness results under varying assumptions on the input and noise of two-layer networks, and gives the first polynomial time algorithm that approximately recovers the weights in the presence of mean-zero noise.

Learning One-hidden-layer Neural Networks with Landscape Design

- Computer ScienceICLR
- 2018

A non-convex objective function $G(\cdot)$ is designed whose landscape is guaranteed to have the following properties: all local minima of $G$ are also global minima and stochastic gradient descent provably converges to the global minimum and learn the ground-truth parameters.

Linearized two-layers neural networks in high dimension

- Computer ScienceArXiv
- 2019

It is proved that, if both $d$ and $N$ are large, the behavior of these models is instead remarkably simpler, and an equally simple bound on the generalization error of Kernel Ridge Regression is obtained.

Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations

- Computer ScienceCOLT
- 2018

The gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations and the results solve the conjecture of Gunasekar et al.

Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

- Computer ScienceICML
- 2018

We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j…

An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis

- Computer ScienceICML
- 2017

It is proved that critical points outside the hyperplane spanned by the teacher parameters ("out-of-plane") are not isolated and form manifolds, and characterize in-plane critical-point-free regions for two ReLU case.

Recovery Guarantees for One-hidden-layer Neural Networks

- Computer ScienceICML
- 2017

This work distill some properties of activation functions that lead to local strong convexity in the neighborhood of the ground-truth parameters for the 1NN squared-loss objective, and provides recovery guarantees for 1NNs with both sample complexity and computational complexity $\mathit{linear}$ in the input dimension and $\math it{logarithmic}$in the precision.

A Convergence Theory for Deep Learning via Over-Parameterization

- Computer ScienceICML
- 2019

This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.

Learning Two-layer Neural Networks with Symmetric Inputs

- Computer ScienceICLR
- 2019

A new algorithm for learning a two-layer neural network under a general class of input distributions based on the method-of-moments framework and extends several results in tensor decompositions to avoid the complicated non-convex optimization in learning neural networks.

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

- Computer ScienceNeurIPS
- 2019

The expected $0$-$1$ loss of a wide enough ReLU network trained with stochastic gradient descent and random initialization can be bounded by the training Loss of a random feature model induced by the network gradient at initialization, which is called a neural tangent random feature (NTRF) model.