Corpus ID: 4797041

Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks

@inproceedings{Goel2017EigenvalueDI,
  title={Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks},
  author={Surbhi Goel and Adam R. Klivans},
  booktitle={NIPS},
  year={2017}
}
We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worst-case. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non… Expand
Learning Neural Networks with Two Nonlinear Layers in Polynomial Time
TLDR
This work gives a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU), and suggests a new approach to Boolean learning problems via real-valued conditional-mean functions, sidestepping traditional hardness results from computational learning theory. Expand
Learning Depth-Three Neural Networks in Polynomial Time
TLDR
The first polynomial-time algorithm for learning intersections of halfspaces with a margin (distribution-free) and the first generalization of DNF learning to the setting of probabilistic concepts (queries; uniform distribution). Expand
From Boltzmann Machines to Neural Networks and Back Again
TLDR
This work gives new results for learning Restricted Boltzmann Machines, probably the most well-studied class of latent variable models, and gives an algorithm for learning a natural class of supervised RBMs with better runtime than what is possible for its related class of networks without distributional assumptions. Expand
Convergence Analysis of Two-layer Neural Networks with ReLU Activation
TLDR
A convergence analysis for SGD is provided on a rich subset of two-layer feedforward networks with ReLU activations characterized by a special structure called "identity mapping" that proves that, if input follows from Gaussian distribution, with standard $O(1/\sqrt{d})$ initialization of the weights, SGD converges to the global minimum in polynomial number of steps. Expand
Frequency Bias in Neural Networks for Input of Non-Uniform Density
TLDR
The Neural Tangent Kernel model is used to explore the effect of variable density on training dynamics and convergence results for deep, fully connected networks with respect to the spectral decomposition of the NTK are proved. Expand
Computational hardness of fast rates for online sparse PCA : improperness does not help
One of the most frequent and successful techniques to deal with computational intractability in statistics and machine learning is improper learning by convex relaxation: namely enlarging the classExpand
Spectral Analysis and Stability of Deep Neural Dynamics
TLDR
The view of neural networks as affine parameter varying maps allows to "crack open the black box" of global neural network dynamical behavior through visualization of stationary points, regions of attraction, state-space partitioning, eigenvalue spectra, and stability properties. Expand
To understand deep learning we need to understand kernel learning
TLDR
It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods. Expand
Learning Graph Neural Networks with Approximate Gradient Descent
TLDR
The first provably efficient algorithm for learning graph neural networks (GNNs) with one hidden layer for node information convolution is provided, and it is shown that the proposed algorithm guarantees a linear convergence rate to the underlying true parameters of GNNs. Expand
Learning One Convolutional Layer with Overlapping Patches
TLDR
Convotron is given the first provably efficient algorithm for learning a one hidden layer convolutional network with respect to a general class of (potentially overlapping) patches and it is proved that the framework captures commonly used schemes from computer vision, including one-dimensional and two-dimensional "patch and stride" convolutions. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 51 REFERENCES
On the Complexity of Learning Neural Networks
TLDR
A comprehensive lower bound is demonstrated ruling out the possibility that data generated by neural networks with a single hidden layer, smooth activation functions and benign input distributions can be learned efficiently, and is robust to small perturbations of the true weights. Expand
L1-regularized Neural Networks are Improperly Learnable in Polynomial Time
TLDR
A kernel-based method, such that with probability at least 1 - δ, it learns a predictor whose generalization error is at most e worse than that of the neural network, implies that any sufficiently sparse neural network is learnable in polynomial time. Expand
Diversity Leads to Generalization in Neural Networks
TLDR
It is shown that despite the non-convexity, neural networks with diverse units can learn the true function of the loss function, and a novel regularization function is suggested to promote unit diversity for potentially better generalization ability. Expand
Learning Halfspaces and Neural Networks with Random Initialization
TLDR
It is shown that if the data is separable by some neural network with constant margin $\gamma>0$, then there is a polynomial-time algorithm for learning a neural network that separates the training data with margin $\Omega(\gamma)$. Expand
SGD Learns the Conjugate Kernel Class of the Network
We show that the standard stochastic gradient decent (SGD) algorithm is guaranteed to learn, in polynomial time, a function that is competitive with the best function in the conjugate kernel space ofExpand
Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods
TLDR
This work proposes a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. Expand
Distribution-Specific Hardness of Learning Neural Networks
  • O. Shamir
  • Computer Science, Mathematics
  • J. Mach. Learn. Res.
  • 2018
TLDR
This paper identifies a family of simple target functions, which are difficult to learn even if the input distribution is "nice", and provides evidence that neither class of assumptions alone is sufficient. Expand
Reliably Learning the ReLU in Polynomial Time
TLDR
A hypothesis is constructed that simultaneously minimizes the false-positive rate and the loss on inputs given positive labels by $\cal{D}$, for any convex, bounded, and Lipschitz loss function. Expand
Embedding Hard Learning Problems into Gaussian Space
TLDR
The first representation-independent hardness result for agnostically learning halfspaces with respect to the Gaussian distribution is given, showing the inherent diculty of designing supervised learning algorithms in Euclidean space even in the presence of strong distributional assumptions. Expand
No bad local minima: Data independent training error guarantees for multilayer neural networks
TLDR
It is proved that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization, and extended to the case of more than onehidden layer. Expand
...
1
2
3
4
5
...