• Corpus ID: 7174183

Globally Optimal Training of Generalized Polynomial Neural Networks with Nonlinear Spectral Methods

@inproceedings{Gautier2016GloballyOT,
  title={Globally Optimal Training of Generalized Polynomial Neural Networks with Nonlinear Spectral Methods},
  author={Antoine Gautier and Quynh N. Nguyen and Matthias Hein},
  booktitle={NIPS},
  year={2016}
}
The optimization problem behind neural networks is highly non-convex. Training with stochastic gradient descent and variants requires careful parameter tuning and provides no guarantee to achieve the global optimum. In contrast we show under quite weak assumptions on the data that a particular class of feedforward neural networks can be trained globally optimal with a linear convergence rate with our nonlinear spectral method. Up to our knowledge this is the first practically feasible method… 

Figures and Tables from this paper

The Loss Surface of Deep and Wide Neural Networks

It is shown that in fact almost all local minima are globally optimal, for a fully connected network with squared loss and analytic activation function given that the number of hidden units of one layer of the network is larger than thenumber of training points and the network structure from this layer on is pyramidal.

Over-Parameterized Deep Neural Networks Have No Strict Local Minima For Any Continuous Activations

It is proved that for any continuous activation functions, the loss function has no bad strict local minimum, both in the regular sense and in the sense of sets, and this result holds for any convex and differentiable loss function.

When is a Convolutional Filter Easy To Learn?

It is shown that (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches.

Nonlinear Spectral Duality

This work shows that one can move from the primal to the dual nonlinear eigenvalue formulation main-taining the spectrum, the variational spectrum as well as the corresponding multiplicities unchanged, which can be used to transform the original optimization problem into various alternative and possibly more treatable dual problems.

On the loss landscape of a class of deep neural networks with no bad local valleys

We identify a class of over-parameterized deep neural networks with standard activation functions and cross-entropy loss which provably have no bad local valley, in the sense that from any point in

Nonlinear Spectral Methods for Nonconvex Optimization with Global Optimality

We present a method for solving a class of nonconvex optimization problems over the product of nonnegative ` spheres with global optimality guarantees and linear convergence rate. We apply our

Understanding the Loss Surface of Neural Networks for Binary Classification

This work focuses on the training performance of single-layered neural networks for binary classification, and provides conditions under which the training error is zero at all local minima of a smooth hinge loss function.

Learning Graph Neural Networks with Approximate Gradient Descent

The first provably efficient algorithm for learning graph neural networks (GNNs) with one hidden layer for node information convolution is provided, and it is shown that the proposed algorithm guarantees a linear convergence rate to the underlying true parameters of GNNs.

When Can Neural Networks Learn Connected Decision Regions?

The sufficient and necessary conditions under which the decision regions of a neural network are connected are developed and the capacity to learn connected regions of neural networks for a much wider class of activation functions including those widely used, namely ReLU, sigmoid, tanh, softlus, and exponential linear function is studied.

T HE LOSS SURFACE AND EXPRESSIVITY OF DEEP CONVOLUTIONAL NEURAL NETWORKS

We analyze the expressiveness and loss surface of practical deep convolutional neural networks (CNNs) with shared weights and max pooling layers. We show that such CNNs produce linearly independent

Train faster, generalize better: Stability of stochastic gradient descent

We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

This work proposes a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions.

Global Optimality in Tensor Factorization, Deep Learning, and Beyond

This framework derives sufficient conditions to guarantee that a local minimum of the non-convex optimization problem is a global minimum and shows that if the size of the factorized variables is large enough then from any initialization it is possible to find a global minimizer using a purely local descent algorithm.

On the Computational Efficiency of Training Neural Networks

This paper revisits the computational complexity of training neural networks from a modern perspective and provides both positive and negative results, some of them yield new provably efficient and practical algorithms for training certain types of neural networks.

Provable Bounds for Learning Some Deep Representations

This work gives algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others, based upon a novel idea of observing correlations among features and using these to infer the underlying edge structure via a global graph recovery procedure.

The Loss Surfaces of Multilayer Networks

It is proved that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.

Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity

It is shown that initial representations generated by common random initializations are sufficiently rich to express all functions in the dual kernel space, and though the training objective is hard to optimize in the worst case, the initial weights form a good starting point for optimization.

Training a Single Sigmoidal Neuron Is Hard

It is proved that the simplest architecture containing only a single neuron that applies a sigmoidal activation function sigma, satisfying certain natural axioms, to the weighted sum of n inputs is hard to train.

Neural Network Learning - Theoretical Foundations

The authors explain the role of scale-sensitive versions of the Vapnik Chervonenkis dimension in large margin classification, and in real prediction, and discuss the computational complexity of neural network learning.

Deep learning in neural networks: An overview