Learning Deep ReLU Networks Is Fixed-Parameter Tractable

@article{Chen2022LearningDR,
  title={Learning Deep ReLU Networks Is Fixed-Parameter Tractable},
  author={Sitan Chen and Adam R. Klivans and Raghu Meka},
  journal={2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS)},
  year={2022},
  pages={696-707}
}
  • Sitan ChenAdam R. KlivansR. Meka
  • Published 28 September 2020
  • Computer Science, Mathematics
  • 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS)
We consider the problem of learning an unknown ReLU network with respect to Gaussian inputs and obtain the first nontrivial results for networks of depth more than two. We give an algorithm whose running time is a fixed polynomial in the ambient dimension and some (exponentially large) function of only the network's parameters. Our results provably cannot be obtained using gradient-based methods and give the first example of a class of efficiently learnable neural networks that gradient descent… 

The Computational Complexity of ReLU Network Training Parameterized by Data Dimensionality

This work provides running time lower bounds in terms of W[1]-hardness for parameter d and proves that known brute-force strategies are essentially optimal (assuming the Exponential Time Hypothesis).

Learning (Very) Simple Generative Models Is Hard

The key ingredient in the proof is an ODE-based construction of a compactly supported, piecewise-linear function f with polynomially-bounded slopes such that the pushforward of N under f matches all low-degree moments of N (0, 1).

Bounding the Width of Neural Networks via Coupled Initialization - A Worst Case Analysis

This work shows how to significantly reduce the number of neurons required for two-layer ReLU networks, both in the under-parameterized setting with logistic loss and with squared loss, and proves new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.

Algorithms for Efficiently Learning Low-Rank Neural Networks

This work presents a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network up to additive error with probability ≥ 1 − δ, given access to noiseless samples with Gaussian marginals in polynomial time and samples.

Training Fully Connected Neural Networks is ∃R-Complete

The algorithmic problem of finding the optimal weights and biases for a two-layer fully connected neural network to a given set of data points is considered and it is shown that even very simple networks are difficult to train.

Hardness of Noise-Free Learning for Two-Hidden-Layer Neural Networks

Superpolynomial statistical query lower bounds for learning two-hidden-layer ReLU networks with respect to Gaussian inputs in the standard (noise-free) model are given and a lifting procedure due to Daniely and Vardi is shown that reduces Boolean PAC learning problems toGaussian ones.

Training Fully Connected Neural Networks is $\exists\mathbb{R}$-Complete

The algorithmic problem of finding the optimal weights and biases for a two-layer fully connected neural network to a given set of data points is considered and it is shown that even very simple networks are difficult to train.

Efficient Algorithms for Learning Depth-2 Neural Networks with General ReLU Activations

This work considers learning an unknown network of the form f(x) = aσ(Wx+b), where x is drawn from the Gaussian distribution, and σ(t) := max(t, 0) is the ReLU activation.

A Convergence Analysis of Gradient Descent on Graph Neural Networks

It is proved that for the case of deep linear GNNs gradient descent provably recovers solutions up to error in O(log(1/ )) iterations, under natural assumptions on the data distribution.

Efficiently Learning Any One Hidden Layer ReLU Network From Queries

This work gives the first polynomial-time algorithm for learning one hidden layer neural networks provided black-box access to the network, and it is shown that if F is an arbitrary onehidden layer neural network with ReLU activations, there is an algorithm with query complexity and running time that outputs a network F achieving low square loss relative to F with respect to the Gaussian measure.

References

SHOWING 1-10 OF 55 REFERENCES

SGD Learns the Conjugate Kernel Class of the Network

We show that the standard stochastic gradient decent (SGD) algorithm is guaranteed to learn, in polynomial time, a function that is competitive with the best function in the conjugate kernel space of

On the Complexity of Learning Neural Networks

A comprehensive lower bound is demonstrated ruling out the possibility that data generated by neural networks with a single hidden layer, smooth activation functions and benign input distributions can be learned efficiently, and is robust to small perturbations of the true weights.

L1-regularized Neural Networks are Improperly Learnable in Polynomial Time

A kernel-based method, such that with probability at least 1 - δ, it learns a predictor whose generalization error is at most e worse than that of the neural network, implies that any sufficiently sparse neural network is learnable in polynomial time.

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples.

Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds

An agnostic learning guarantee is given for GD: starting from a randomly initialized network, it converges in mean squared loss to the minimum error of the best approximation of the target function using a polynomial of degree at most $k$.

Learning One-hidden-layer Neural Networks under General Input Distributions

A novel unified framework to design loss functions with desirable landscape properties for a wide range of general input distributions, and brings statistical methods of local likelihood to design a novel estimator of score functions, that provably adapts to the local geometry of the unknown density.

Learning Two-layer Neural Networks with Symmetric Inputs

A new algorithm for learning a two-layer neural network under a general class of input distributions based on the method-of-moments framework and extends several results in tensor decompositions to avoid the complicated non-convex optimization in learning neural networks.

Learning One-hidden-layer ReLU Networks via Gradient Descent

It is proved that tensor initialization followed by gradient descent can converge to the ground-truth parameters at a linear rate up to some statistical error.

The Computational Complexity of Training ReLU(s)

It is shown that, when the weights and samples belong to the unit ball, one can (agnostically) properly and reliably learn depth-2 ReLUs with $k$ units and error at most $\epsilon$ in time, which extends upon a previous work of Goel, Kanade, Klivans and Thaler (2017).

Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent

It is proved that any classifier trained using gradient descent with respect to square-loss will fail to achieve small test error in polynomial time given access to samples labeled by a one-layer neural network.
...