Corpus ID: 236155045

Efficient Algorithms for Learning Depth-2 Neural Networks with General ReLU Activations

  title={Efficient Algorithms for Learning Depth-2 Neural Networks with General ReLU Activations},
  author={Pranjal Awasthi and Alex K. Tang and Aravindan Vijayaraghavan},
We present polynomial time and sample efficient algorithms for learning an unknown depth-2 feedforward neural network with general ReLU activations, under mild non-degeneracy assumptions. In particular, we consider learning an unknown network of the form f(x) = aσ(Wx+b), where x is drawn from the Gaussian distribution, and σ(t) := max(t, 0) is the ReLU activation. Prior works for learning networks with ReLU activations assume that the bias b is zero. In order to deal with the presence of the… Expand


Learning Deep ReLU Networks Is Fixed-Parameter Tractable
An algorithm whose running time is a fixed polynomial in the ambient dimension and some (exponentially large) function of only the network's parameters is given, whose bounds depend on the number of hidden units, depth, spectral norm of the weight matrices, and Lipschitz constant of the overall network. Expand
Learning Two-layer Neural Networks with Symmetric Inputs
A new algorithm for learning a two-layer neural network under a general class of input distributions based on the method-of-moments framework and extends several results in tensor decompositions to avoid the complicated non-convex optimization in learning neural networks. Expand
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Expand
Learning Neural Networks with Two Nonlinear Layers in Polynomial Time
This work gives a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU), and suggests a new approach to Boolean learning problems via real-valued conditional-mean functions, sidestepping traditional hardness results from computational learning theory. Expand
Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods
This work proposes a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. Expand
A Convergence Theory for Deep Learning via Over-Parameterization
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Expand
SGD Learns the Conjugate Kernel Class of the Network
We show that the standard stochastic gradient decent (SGD) algorithm is guaranteed to learn, in polynomial time, a function that is competitive with the best function in the conjugate kernel space ofExpand
Learning Polynomials with Neural Networks
This paper shows that for a randomly initialized neural network with sufficiently many hidden units, the generic gradient descent algorithm learns any low degree polynomial, assuming the authors initialize the weights randomly, and shows that if they use complex-valued weights, there are no "robust local minima". Expand
On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport
It is shown that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers and involves Wasserstein gradient flows, a by-product of optimal transport theory. Expand
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure. Expand