Corpus ID: 9943348

SGD Learns the Conjugate Kernel Class of the Network

@inproceedings{Daniely2017SGDLT,
  title={SGD Learns the Conjugate Kernel Class of the Network},
  author={Amit Daniely},
  booktitle={NIPS},
  year={2017}
}
  • Amit Daniely
  • Published in NIPS 2017
  • Computer Science, Mathematics
We show that the standard stochastic gradient decent (SGD) algorithm is guaranteed to learn, in polynomial time, a function that is competitive with the best function in the conjugate kernel space of the network, as defined in Daniely, Frostig and Singer. The result holds for log-depth networks from a rich family of architectures. To the best of our knowledge, it is the first polynomial-time guarantee for the standard neural network learning algorithm for networks of depth more that two. As… Expand
Learning Deep ReLU Networks Is Fixed-Parameter Tractable
TLDR
An algorithm whose running time is a fixed polynomial in the ambient dimension and some (exponentially large) function of only the network's parameters is given, whose bounds depend on the number of hidden units, depth, spectral norm of the weight matrices, and Lipschitz constant of the overall network. Expand
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
TLDR
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. Expand
Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel
TLDR
It is proved that for infinite-width two-layer nets, noisy gradient descent optimizes the regularized neural net loss to a global minimum in polynomial iterations. Expand
Disentangling trainability and generalization in deep learning
TLDR
This paper discusses challenging issues in the context of wide neural networks at large depths and finds that there are large regions of hyperparameter space where networks can only memorize the training set in the sense they reach perfect training accuracy but completely fail to generalize outside the trainingSet. Expand
Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks
TLDR
This work shows that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs). Expand
A Convergence Theory for Deep Learning via Over-Parameterization
TLDR
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Expand
A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks
TLDR
It is proved that under certain assumption on the data distribution that is milder than linear separability, gradient descent with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small expected error, leading to an algorithmic-dependent generalization error bound for deep learning. Expand
Learning Neural Networks with Two Nonlinear Layers in Polynomial Time
TLDR
This work gives a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU), and suggests a new approach to Boolean learning problems via real-valued conditional-mean functions, sidestepping traditional hardness results from computational learning theory. Expand
On the Global Convergence of Training Deep Linear ResNets
TLDR
It is proved that for training deep residual networks with certain linear transformations at input and output layers, both GD and SGD with zero initialization on all hidden weights can converge to the global minimum of the training loss. Expand
The Implications of Local Correlation on Learning Some Deep Functions
TLDR
It is proved that, for some classes of deep functions, weak learning implies efficient strong learning under the “local correlation” assumption, and it is empirically demonstrated that this property holds for the CIFAR and ImageNet data sets. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 39 REFERENCES
L1-regularized Neural Networks are Improperly Learnable in Polynomial Time
TLDR
A kernel-based method, such that with probability at least 1 - δ, it learns a predictor whose generalization error is at most e worse than that of the neural network, implies that any sufficiently sparse neural network is learnable in polynomial time. Expand
Provable Bounds for Learning Some Deep Representations
TLDR
This work gives algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others, based upon a novel idea of observing correlations among features and using these to infer the underlying edge structure via a global graph recovery procedure. Expand
Learning Polynomials with Neural Networks
TLDR
This paper shows that for a randomly initialized neural network with sufficiently many hidden units, the generic gradient descent algorithm learns any low degree polynomial, assuming the authors initialize the weights randomly, and shows that if they use complex-valued weights, there are no "robust local minima". Expand
Convexified Convolutional Neural Networks
TLDR
For learning two-layer convolutional neural networks, it is proved that the generalization error obtained by a convexified CNN converges to that of the best possible CNN. Expand
Breaking the Curse of Dimensionality with Convex Neural Networks
  • F. Bach
  • Computer Science, Mathematics
  • J. Mach. Learn. Res.
  • 2017
TLDR
This work considers neural networks with a single hidden layer and non-decreasing homogeneous activa-tion functions like the rectified linear units and shows that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace. Expand
Learning Halfspaces and Neural Networks with Random Initialization
TLDR
It is shown that if the data is separable by some neural network with constant margin $\gamma>0$, then there is a polynomial-time algorithm for learning a neural network that separates the training data with margin $\Omega(\gamma)$. Expand
On the Computational Efficiency of Training Neural Networks
TLDR
This paper revisits the computational complexity of training neural networks from a modern perspective and provides both positive and negative results, some of them yield new provably efficient and practical algorithms for training certain types of neural networks. Expand
Convolutional Kernel Networks
TLDR
This paper proposes a new type of convolutional neural network (CNN) whose invariance is encoded by a reproducing kernel, and bridges a gap between the neural network literature and kernels, which are natural tools to model invariance. Expand
Understanding the difficulty of training deep feedforward neural networks
TLDR
The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. Expand
Weakly learning DNF and characterizing statistical query learning using Fourier analysis
TLDR
It is proved that an algorithm due to Kushilevitz and Mansour can be used to weakly learn DNF using membership queries in polynomial time, with respect to the uniform distribution on the inputs, and it is obtained that DNF expressions and decision trees are not evenWeakly learnable with any unproven assumptions. Expand
...
1
2
3
4
...