• Corpus ID: 220424706

# Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK

@article{Li2020LearningOT,
title={Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK},
author={Yuanzhi Li and Tengyu Ma and Hongyang R. Zhang},
journal={ArXiv},
year={2020},
volume={abs/2007.04596}
}
• Published 9 July 2020
• Computer Science
• ArXiv
We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an over-parametrized two-layer neural network with ReLU activation, trained by gradient descent from random initialization, can provably…

## Figures from this paper

On the Provable Generalization of Recurrent Neural Networks
• Computer Science
NeurIPS
• 2021
A generalization error bound is proved to learn functions of input sequence with the form f(β [Xl1 , ..., XlN ]), which do not belong to the “additive” concept class, i,e.
ON LEARNING READ-ONCE DNFS WITH NEURAL NETWORKS
• Computer Science
• 2020
A computer assisted proof is used to prove the inductive bias for relatively small DNFs, and a process is designed to design a process for reconstructing the DNF from the learned network to better understand the resulting induction bias.
Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods
• Computer Science
ICLR
• 2021
It is shown that any linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting and so-called fast learning rate is obtained.
Local Signal Adaptivity: Provable Feature Learning in Neural Networks Beyond Kernels
• Computer Science
NeurIPS
• 2021
The local signal adaptivity (LSA) phenomenon is proposed as one explanation for the superiority of neural networks over kernel methods in the image classification setting, based on finding a sparse signal in the presence of noise.
A Convergence Analysis of Gradient Descent on Graph Neural Networks
• Computer Science
NeurIPS
• 2021
It is proved that for the case of deep linear GNNs gradient descent provably recovers solutions up to error in O(log(1/ )) iterations, under natural assumptions on the data distribution.
Efficiently Learning Any One Hidden Layer ReLU Network From Queries
• Computer Science
ArXiv
• 2021
This work gives the first polynomial-time algorithm for learning one hidden layer neural networks provided black-box access to the network, and it is shown that if F is an arbitrary onehidden layer neural network with ReLU activations, there is an algorithm with query complexity and running time that outputs a network F achieving low square loss relative to F with respect to the Gaussian measure.
Provable Acceleration of Neural Net Training via Polyak's Momentum
• Computer Science
ArXiv
• 2020
It is shown that Polyak's momentum, in combination with over-parameterization of the model, helps achieve faster convergence in training a one-layer ReLU network on $n$ examples.
Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent
• Computer Science
NeurIPS
• 2021
A unified non-convex optimization framework for the analysis of neural network training is proposed and it is shown that stochastic gradient descent on objectives satisfying proxy convexity or the proxy Polyak-Lojasiewicz inequality leads to efficient guarantees for proxy objective functions.
Efficiently Learning One Hidden Layer Neural Networks From Queries
This work gives the first polynomial-time algorithm for learning one hidden layer neural networks provided black-box access to the network, and it is shown that if F is an arbitrary onehidden layer neural network with ReLU activations, there is an algorithm with query complexity and running time that outputs a network F achieving low square loss relative to F with respect to the Gaussian measure.

## References

SHOWING 1-10 OF 66 REFERENCES
Learning Two Layer Rectified Neural Networks in Polynomial Time
• Computer Science, Mathematics
COLT
• 2019
This work develops algorithms and hardness results under varying assumptions on the input and noise of two-layer networks, and gives the first polynomial time algorithm that approximately recovers the weights in the presence of mean-zero noise.
Learning One-hidden-layer Neural Networks with Landscape Design
• Computer Science
ICLR
• 2018
A non-convex objective function $G(\cdot)$ is designed whose landscape is guaranteed to have the following properties: all local minima of $G$ are also global minima and stochastic gradient descent provably converges to the global minimum and learn the ground-truth parameters.
Linearized two-layers neural networks in high dimension
• Computer Science
ArXiv
• 2019
It is proved that, if both $d$ and $N$ are large, the behavior of these models is instead remarkably simpler, and an equally simple bound on the generalization error of Kernel Ridge Regression is obtained.
Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations
• Computer Science
COLT
• 2018
The gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations and the results solve the conjecture of Gunasekar et al.
Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima
• Computer Science
ICML
• 2018
We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis It is proved that critical points outside the hyperplane spanned by the teacher parameters ("out-of-plane") are not isolated and form manifolds, and characterize in-plane critical-point-free regions for two ReLU case. Recovery Guarantees for One-hidden-layer Neural Networks • Computer Science ICML • 2017 This work distill some properties of activation functions that lead to local strong convexity in the neighborhood of the ground-truth parameters for the 1NN squared-loss objective, and provides recovery guarantees for 1NNs with both sample complexity and computational complexity$\mathit{linear}$in the input dimension and$\math it{logarithmic}$in the precision. A Convergence Theory for Deep Learning via Over-Parameterization • Computer Science ICML • 2019 This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in$\textit{polynomial time}$and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Learning Two-layer Neural Networks with Symmetric Inputs • Computer Science ICLR • 2019 A new algorithm for learning a two-layer neural network under a general class of input distributions based on the method-of-moments framework and extends several results in tensor decompositions to avoid the complicated non-convex optimization in learning neural networks. Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks • Computer Science NeurIPS • 2019 The expected$0$-$1\$ loss of a wide enough ReLU network trained with stochastic gradient descent and random initialization can be bounded by the training Loss of a random feature model induced by the network gradient at initialization, which is called a neural tangent random feature (NTRF) model.