# Learning Halfspaces and Neural Networks with Random Initialization

@article{Zhang2015LearningHA, title={Learning Halfspaces and Neural Networks with Random Initialization}, author={Yuchen Zhang and J. Lee and M. Wainwright and Michael I. Jordan}, journal={ArXiv}, year={2015}, volume={abs/1511.07948} }

We study non-convex empirical risk minimization for learning halfspaces and neural networks. For loss functions that are $L$-Lipschitz continuous, we present algorithms to learn halfspaces and multi-layer neural networks that achieve arbitrarily small excess risk $\epsilon>0$. The time complexity is polynomial in the input dimension $d$ and the sample size $n$, but exponential in the quantity $(L/\epsilon^2)\log(L/\epsilon)$. These algorithms run multiple rounds of random initialization… Expand

#### 35 Citations

Reliably Learning the ReLU in Polynomial Time

- Computer Science, Mathematics
- COLT
- 2017

A hypothesis is constructed that simultaneously minimizes the false-positive rate and the loss on inputs given positive labels by $\cal{D}$, for any convex, bounded, and Lipschitz loss function. Expand

Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks

- Computer Science, Mathematics
- NIPS
- 2017

This work shows that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs). Expand

Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks

- Mathematics
- 2016

Let $ f^{\star} $ be a function on $ \mathbb{R}^d $ with an assumption of a spectral norm $ v_{f^{\star}} $. For various noise settings, we show that $ \mathbb{E}\|\hat{f} - f^{\star} \|^2 \leq… Expand

On the Quality of the Initial Basin in Overspecified Neural Networks

- Computer Science, Mathematics
- ICML
- 2016

This work studies thegeometric structure of the associated non-convex objective function, in the context of ReLU networks and starting from a random initialization of the network parameters, and identifies some conditions under which it becomes more favorable to optimization. Expand

How Many Samples are Needed to Learn a Convolutional Neural Network?

- Computer Science, Mathematics
- NIPS 2018
- 2018

It is shown that for learning an $m-dimensional convolutional filter with linear activation acting on a $d$-dimensional input, the sample complexity of achieving population prediction error of $\epsilon$ is $\widetilde{O} (m/\Epsilon^2)$, whereas its FNN counterpart needs at least $\Omega(d/\epsil on)$ samples. Expand

Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

- Computer Science, Mathematics
- ICML
- 2018

We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j… Expand

Distribution-Specific Hardness of Learning Neural Networks

- Computer Science, Mathematics
- J. Mach. Learn. Res.
- 2018

This paper identifies a family of simple target functions, which are difficult to learn even if the input distribution is "nice", and provides evidence that neither class of assumptions alone is sufficient. Expand

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

- Computer Science, Mathematics
- NIPS
- 2017

A convergence analysis for SGD is provided on a rich subset of two-layer feedforward networks with ReLU activations characterized by a special structure called "identity mapping" that proves that, if input follows from Gaussian distribution, with standard $O(1/\sqrt{d})$ initialization of the weights, SGD converges to the global minimum in polynomial number of steps. Expand

SGD Learns the Conjugate Kernel Class of the Network

- Computer Science, Mathematics
- NIPS
- 2017

We show that the standard stochastic gradient decent (SGD) algorithm is guaranteed to learn, in polynomial time, a function that is competitive with the best function in the conjugate kernel space of… Expand

How Many Samples are Needed to Estimate a Convolutional Neural Network?

- Computer Science, Mathematics
- NeurIPS
- 2018

A widespread folklore for explaining the success of Convolutional Neural Networks (CNNs) is that CNNs use a more compact representation than the Fully-connected Neural Network (FNN) and thus require… Expand

#### References

SHOWING 1-10 OF 40 REFERENCES

Learning Kernel-Based Halfspaces with the 0-1 Loss

- Mathematics, Computer Science
- SIAM J. Comput.
- 2011

A new algorithm for agnostically learning kernel-based halfspaces with respect to the 0-1 loss function is described and analyzed and proves a hardness result, showing that under a certain cryptographic assumption, no algorithm can learn kernel- based halfspace in time polynomial in $L$. Expand

L1-regularized Neural Networks are Improperly Learnable in Polynomial Time

- Mathematics, Computer Science
- ICML
- 2016

A kernel-based method, such that with probability at least 1 - δ, it learns a predictor whose generalization error is at most e worse than that of the neural network, implies that any sufficiently sparse neural network is learnable in polynomial time. Expand

Efficient Learning of Linear Separators under Bounded Noise

- Computer Science, Mathematics
- COLT
- 2015

This work provides the first evidence that one can indeed design algorithms achieving arbitrarily small excess error in polynomial time under this realistic noise model and thus opens up a new and exciting line of research. Expand

Agnostically learning halfspaces

- Mathematics, Computer Science
- 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05)
- 2005

We give the first algorithm that (under distributional assumptions) efficiently learns halfspaces in the notoriously difficult agnostic framework of Kearns, Schapire, & Sellie, where a learner is… Expand

Efficient Learning of Linear Perceptrons

- Computer Science, Mathematics
- NIPS
- 2000

It is proved that unless P=NP, there is no algorithm that runs in time polynomial in the sample size and in 1/µ that is µ-margin successful for all µ > 0. Expand

Generalization Bounds for Neural Networks through Tensor Factorization

- Computer Science, Mathematics
- ArXiv
- 2015

This work proposes a novel algorithm based on tensor decomposition for training a two-layer neural network, and proves efficient generalization bounds for this method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. Expand

Hardness of Learning Halfspaces with Noise

- Computer Science, Mathematics
- 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06)
- 2006

It is proved that even a tiny amount of worst-case noise makes the problem of learning halfspaces intractable in a strong sense, and a strong hardness is obtained for another basic computational problem: solving a linear system over the rationals. Expand

Learning Halfspaces with Malicious Noise

- Mathematics, Computer Science
- ICALP
- 2009

New algorithms for learning halfspaces in the challenging malicious noise model can tolerate malicious noise rates exponentially larger than previous work in terms of the dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave distributions. Expand

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

- Computer Science
- 2017

This work proposes a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. Expand

Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

- Computer Science, Mathematics
- NIPS
- 2012

It is shown that there are cases in which α = o(1/γ) but the problem is still solvable in polynomial time, and that this results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model. Expand