• Corpus ID: 220424706

Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK

@article{Li2020LearningOT,
  title={Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK},
  author={Yuanzhi Li and Tengyu Ma and Hongyang R. Zhang},
  journal={ArXiv},
  year={2020},
  volume={abs/2007.04596}
}
We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an over-parametrized two-layer neural network with ReLU activation, trained by gradient descent from random initialization, can provably… 

Figures from this paper

On the Provable Generalization of Recurrent Neural Networks
TLDR
A generalization error bound is proved to learn functions of input sequence with the form f(β [Xl1 , ..., XlN ]), which do not belong to the “additive” concept class, i,e.
ON LEARNING READ-ONCE DNFS WITH NEURAL NETWORKS
  • Computer Science
  • 2020
TLDR
A computer assisted proof is used to prove the inductive bias for relatively small DNFs, and a process is designed to design a process for reconstructing the DNF from the learned network to better understand the resulting induction bias.
Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods
TLDR
It is shown that any linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting and so-called fast learning rate is obtained.
Local Signal Adaptivity: Provable Feature Learning in Neural Networks Beyond Kernels
TLDR
The local signal adaptivity (LSA) phenomenon is proposed as one explanation for the superiority of neural networks over kernel methods in the image classification setting, based on finding a sparse signal in the presence of noise.
A Convergence Analysis of Gradient Descent on Graph Neural Networks
TLDR
It is proved that for the case of deep linear GNNs gradient descent provably recovers solutions up to error in O(log(1/ )) iterations, under natural assumptions on the data distribution.
Efficiently Learning Any One Hidden Layer ReLU Network From Queries
TLDR
This work gives the first polynomial-time algorithm for learning one hidden layer neural networks provided black-box access to the network, and it is shown that if F is an arbitrary onehidden layer neural network with ReLU activations, there is an algorithm with query complexity and running time that outputs a network F achieving low square loss relative to F with respect to the Gaussian measure.
Provable Acceleration of Neural Net Training via Polyak's Momentum
TLDR
It is shown that Polyak's momentum, in combination with over-parameterization of the model, helps achieve faster convergence in training a one-layer ReLU network on $n$ examples.
Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent
TLDR
A unified non-convex optimization framework for the analysis of neural network training is proposed and it is shown that stochastic gradient descent on objectives satisfying proxy convexity or the proxy Polyak-Lojasiewicz inequality leads to efficient guarantees for proxy objective functions.
Efficiently Learning One Hidden Layer Neural Networks From Queries
TLDR
This work gives the first polynomial-time algorithm for learning one hidden layer neural networks provided black-box access to the network, and it is shown that if F is an arbitrary onehidden layer neural network with ReLU activations, there is an algorithm with query complexity and running time that outputs a network F achieving low square loss relative to F with respect to the Gaussian measure.
...
1
2
3
...

References

SHOWING 1-10 OF 66 REFERENCES
Learning Two Layer Rectified Neural Networks in Polynomial Time
TLDR
This work develops algorithms and hardness results under varying assumptions on the input and noise of two-layer networks, and gives the first polynomial time algorithm that approximately recovers the weights in the presence of mean-zero noise.
Learning One-hidden-layer Neural Networks with Landscape Design
TLDR
A non-convex objective function $G(\cdot)$ is designed whose landscape is guaranteed to have the following properties: all local minima of $G$ are also global minima and stochastic gradient descent provably converges to the global minimum and learn the ground-truth parameters.
Linearized two-layers neural networks in high dimension
TLDR
It is proved that, if both $d$ and $N$ are large, the behavior of these models is instead remarkably simpler, and an equally simple bound on the generalization error of Kernel Ridge Regression is obtained.
Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations
TLDR
The gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations and the results solve the conjecture of Gunasekar et al.
Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima
We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j
An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis
TLDR
It is proved that critical points outside the hyperplane spanned by the teacher parameters ("out-of-plane") are not isolated and form manifolds, and characterize in-plane critical-point-free regions for two ReLU case.
Recovery Guarantees for One-hidden-layer Neural Networks
TLDR
This work distill some properties of activation functions that lead to local strong convexity in the neighborhood of the ground-truth parameters for the 1NN squared-loss objective, and provides recovery guarantees for 1NNs with both sample complexity and computational complexity $\mathit{linear}$ in the input dimension and $\math it{logarithmic}$in the precision.
A Convergence Theory for Deep Learning via Over-Parameterization
TLDR
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.
Learning Two-layer Neural Networks with Symmetric Inputs
TLDR
A new algorithm for learning a two-layer neural network under a general class of input distributions based on the method-of-moments framework and extends several results in tensor decompositions to avoid the complicated non-convex optimization in learning neural networks.
Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks
TLDR
The expected $0$-$1$ loss of a wide enough ReLU network trained with stochastic gradient descent and random initialization can be bounded by the training Loss of a random feature model induced by the network gradient at initialization, which is called a neural tangent random feature (NTRF) model.
...
1
2
3
4
5
...