# Learning Depth-Three Neural Networks in Polynomial Time

@article{Goel2017LearningDN, title={Learning Depth-Three Neural Networks in Polynomial Time}, author={Surbhi Goel and Adam R. Klivans}, journal={ArXiv}, year={2017}, volume={abs/1709.06010} }

We give a polynomial-time algorithm for learning neural networks with one hidden layer of sigmoids feeding into any smooth, monotone activation function (e.g., sigmoid or ReLU). We make no assumptions on the structure of the network, and the algorithm succeeds with respect to {\em any} distribution on the unit ball in $n$ dimensions (hidden weight vectors also have unit norm). This is the first assumption-free, provably efficient algorithm for learning neural networks with more than one hidden…

## 49 Citations

Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds

- Computer ScienceCOLT
- 2019

An agnostic learning guarantee is given for GD: starting from a randomly initialized network, it converges in mean squared loss to the minimum error of the best approximation of the target function using a polynomial of degree at most $k$.

Learning Two-layer Neural Networks with Symmetric Inputs

- Computer ScienceICLR
- 2019

A new algorithm for learning a two-layer neural network under a general class of input distributions based on the method-of-moments framework and extends several results in tensor decompositions to avoid the complicated non-convex optimization in learning neural networks.

Polynomial Convergence of Gradient Descent for Training One-Hidden-Layer Neural Networks

- Computer Science, MathematicsArXiv
- 2018

We analyze Gradient Descent applied to learning a bounded target function on $n$ real-valued inputs by training a neural network with a single hidden layer of nonlinear gates. Our main finding is…

A Deep Conditioning Treatment of Neural Networks

- Computer ScienceALT
- 2021

It is shown that depth improves trainability of neural networks by improving the conditioning of certain kernel matrices of the input data and how benign overfitting can occur in deep neural networks via the results of Bartlett et al. (2019b).

Hardness of Learning Neural Networks with Natural Weights

- Computer ScienceNeurIPS
- 2020

It is shown that for depth-$2$ networks, and many "natural" weights distributions such as the normal and the uniform distribution, most networks are hard to learn, and implies that there is no generic property that holds with high probability in such random networks and allows efficient learning.

Learning One-hidden-layer Neural Networks under General Input Distributions

- Computer ScienceAISTATS
- 2019

A novel unified framework to design loss functions with desirable landscape properties for a wide range of general input distributions, and brings statistical methods of local likelihood to design a novel estimator of score functions, that provably adapts to the local geometry of the unknown density.

Depth separation and weight-width trade-offs for sigmoidal neural networks

- Computer Science, MathematicsICLR
- 2018

This work provides a simple proof of L2-norm separation between the expressive power of depth-2 and depth-3 sigmoidal neural networks for a large class of input distributions, assuming their weights are polynomially bounded.

A Study of Neural Training with Iterative Non-Gradient Methods

- Computer ScienceSSRN Electronic Journal
- 2021

A simple stochastic algorithm is given that can train a ReLU gate in the realizable setting in linear time while using significantly milder conditions on the data distribution than previous results, and approximate recovery of the true label generating parameters is shown.

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

- Computer ScienceNIPS
- 2017

A convergence analysis for SGD is provided on a rich subset of two-layer feedforward networks with ReLU activations characterized by a special structure called "identity mapping" that proves that, if input follows from Gaussian distribution, with standard $O(1/\sqrt{d})$ initialization of the weights, SGD converges to the global minimum in polynomial number of steps.

Towards a Theoretical Understanding of Hashing-Based Neural Nets

- Computer ScienceAISTATS
- 2019

This paper introduces a neural net compression scheme based on random linear sketching, and shows that the sketched network is able to approximate the original network on all input data coming from any smooth well-conditioned low-dimensional manifold, implying that the parameters in HashedNets can be provably recovered.

## References

SHOWING 1-10 OF 50 REFERENCES

On the Complexity of Learning Neural Networks

- Computer ScienceNIPS
- 2017

A comprehensive lower bound is demonstrated ruling out the possibility that data generated by neural networks with a single hidden layer, smooth activation functions and benign input distributions can be learned efficiently, and is robust to small perturbations of the true weights.

Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks

- Computer Science, MathematicsNIPS
- 2017

This work shows that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs).

Recovery Guarantees for One-hidden-layer Neural Networks

- Computer ScienceICML
- 2017

This work distill some properties of activation functions that lead to local strong convexity in the neighborhood of the ground-truth parameters for the 1NN squared-loss objective, and provides recovery guarantees for 1NNs with both sample complexity and computational complexity $\mathit{linear}$ in the input dimension and $\math it{logarithmic}$in the precision.

Distribution-Specific Hardness of Learning Neural Networks

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2018

This paper identifies a family of simple target functions, which are difficult to learn even if the input distribution is "nice", and provides evidence that neither class of assumptions alone is sufficient.

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

- Computer ScienceNIPS
- 2017

A convergence analysis for SGD is provided on a rich subset of two-layer feedforward networks with ReLU activations characterized by a special structure called "identity mapping" that proves that, if input follows from Gaussian distribution, with standard $O(1/\sqrt{d})$ initialization of the weights, SGD converges to the global minimum in polynomial number of steps.

Agnostically learning decision trees

- Computer ScienceSTOC
- 2008

This is the first polynomial-time algorithm for learning decision trees in a harsh noise model and a *proper* agnostic learning algorithm for juntas, a sub-class of decision trees, again using membership queries.

Neural Network Learning - Theoretical Foundations

- Computer Science
- 1999

The authors explain the role of scale-sensitive versions of the Vapnik Chervonenkis dimension in large margin classification, and in real prediction, and discuss the computational complexity of neural network learning.

Learning decision trees using the Fourier spectrum

- Computer Science, MathematicsSTOC '91
- 1991

The authors demonstrate that any functionf whose L -norm is polynomial can be approximated by a polynomially sparse function, and prove that boolean decision trees with linear operations are a subset of this class of functions.

Making polynomials robust to noise

- Mathematics, Computer ScienceSTOC '12
- 2012

A complete solution to the noisy computation problem for real polynomials by constructing a polynomial probust explicitly for each p and contributing a technique of independent interest, which allows one to force partial cancellation of error terms in aPolynomial.

Reliably Learning the ReLU in Polynomial Time

- Computer ScienceCOLT
- 2017

A hypothesis is constructed that simultaneously minimizes the false-positive rate and the loss on inputs given positive labels by $\cal{D}$, for any convex, bounded, and Lipschitz loss function.