• Corpus ID: 56350682

Provable limitations of deep learning

@article{Abbe2018ProvableLO,
  title={Provable limitations of deep learning},
  author={Emmanuel Abbe and Colin Sandon},
  journal={ArXiv},
  year={2018},
  volume={abs/1812.06369}
}
As the success of deep learning reaches more grounds, one would like to also envision the potential limits of deep learning. This paper gives a first set of results proving that certain deep learning algorithms fail at learning certain efficiently learnable functions. The results put forward a notion of cross-predictability that characterizes when such failures take place. Parity functions provide an extreme example with a cross-predictability that decays exponentially, while a mere super… 

Figures from this paper

Poly-time universality and limitations of deep learning

SGD is universal even with some poly-noise while full GD or SQ algorithms are not (e.g., parities); this also gives a separation between SGD-based deep learning and statistical query algorithms.

When Hardness of Approximation Meets Hardness of Learning

This work shows a single hardness property that implies both hardness of approximation using linear classes and shallow networks, and hardness of learning using correlation queries and gradient-descent, which allows for new results on Hardness of approximation and learnability of parity functions, DNF formulas and AC^0$ circuits.

A Review of Deep Learning with Special Emphasis on Architectures, Applications and Recent Trends

This review seeks to present a refresher of the many different stacked, connectionist networks that make up the deep learning architectures followed by automatic architecture optimization protocols using multi-agent approaches and to provide a handy reference to researchers seeking to embrace deep learning in their work for what it is.

Learning Boolean Circuits with Neural Networks

This work focuses on learning deep neural-networks with a variant of gradient-descent, when the target function is a tree-structured Boolean circuit and shows that in this case, the existence of correlation between the gates of the circuit and the target label determines whether the optimization succeeds or fails.

Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions

We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the over-parametrized regime where the layer width

Deep Learning Trends Driven by Temes: A Philosophical Perspective

The essence hidden behind deep learning’s uproar emergence from a technical panorama is refined and its importance and philosophical perspective of deep learning is rendered, and the possible teme-driven directions for deep learning are addressed.

Can Shallow Neural Networks Beat the Curse of Dimensionality? A Mean Field Training Perspective

It is proved that the gradient descent training of a two-layer neural network on empirical or population risk may not decrease population risk at an order faster than $t^{-4/(d-2)}$ under mean field scaling, and gradient descentTraining for fitting reasonably smooth, but truly high-dimensional data may be subject to the curse of dimensionality.

Deep Learning, Grammar Transfer, and Transportation Theory

G grammar transfer is used to demonstrate a paradigm that connects artificial intelligence and human intelligence and it is demonstrated that this learning model can learn a grammar intelligently in general, but fails to follow the optimal way of learning.

Adversarial Attacks on Deep-learning Models in Natural Language Processing

A systematic survey on preliminary knowledge of NLP and related seminal works in computer vision is presented, which collects all related academic works since the first appearance in 2017 and analyzes 40 representative works in a comprehensive way.

Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals

This work proves that finding a ReLU with square-loss $\opt + \epsilon$ is as hard as the problem of learning sparse parities with noise, and implies that gradient descent cannot converge to the global minimum in polynomial time.

References

SHOWING 1-10 OF 43 REFERENCES

Distribution-Specific Hardness of Learning Neural Networks

  • O. Shamir
  • Computer Science, Mathematics
    J. Mach. Learn. Res.
  • 2018
This paper identifies a family of simple target functions, which are difficult to learn even if the input distribution is "nice", and provides evidence that neither class of assumptions alone is sufficient.

Understanding deep learning requires rethinking generalization

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning

  • R. Raz
  • Computer Science, Mathematics
    2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)
  • 2016
It is proved that any algorithm for learning parities requires either a memory of quadratic size or an exponential number of samples, and an encryption scheme that requires a private key of length n, as well as time complexity of n per encryption/decryption of each bit is provenly and unconditionally secure.

Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds

An agnostic learning guarantee is given for GD: starting from a randomly initialized network, it converges in mean squared loss to the minimum error of the best approximation of the target function using a polynomial of degree at most $k$.

Time-Space Tradeoffs for Learning from Small Test Spaces: Learning Low Degree Polynomial Functions

Any algorithm that learns $m-variate homogeneous polynomial functions of degree at most $d$ over $\mathbb{F}_2$ from evaluations on randomly chosen inputs either requires space $\Omega(mn)$ or $2^{\Omega (m)}$ time where $n=m^{\Theta(d)}$ is the dimension of the space of such functions.

Failures of Gradient-Based Deep Learning

This work describes four types of simple problems, for which the gradient-based algorithms commonly used in deep learning either fail or suffer from significant difficulties.

Extractor-based time-space lower bounds for learning

This work shows that for a large class of learning problems, any learning algorithm requires either a memory of size at least Ω(k · l ), or at least 2Ω(r) samples, or an exponential number of samples, achieving a tight Ω((log|X|) · (log|A|)) lower bound on the size of the memory.

A Time-Space Lower Bound for a Large Class of Learning Problems

  • R. Raz
  • Mathematics, Computer Science
    2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS)
  • 2017
We prove a general time-space lower bound that applies for a large class of learning problems and shows that for every problem in that class, any learning algorithm requires either a memory of

Efficient noise-tolerant learning from statistical queries

This paper formalizes a new but related model of learning from statistical queries, and demonstrates the generality of the statistical query model, showing that practically every class learnable in Valiant’s model and its variants can also be learned in the new model (and thus can be learning in the presence of noise).

Weakly learning DNF and characterizing statistical query learning using Fourier analysis

It is proved that an algorithm due to Kushilevitz and Mansour can be used to weakly learn DNF using membership queries in polynomial time, with respect to the uniform distribution on the inputs, and it is obtained that DNF expressions and decision trees are not evenWeakly learnable with any unproven assumptions.