• Corpus ID: 3286674

The Loss Surface of Deep and Wide Neural Networks

  title={The Loss Surface of Deep and Wide Neural Networks},
  author={Quynh N. Nguyen and Matthias Hein},
While the optimization problem behind deep neural networks is highly non-convex, it is frequently observed in practice that training deep networks seems possible without getting stuck in suboptimal points. It has been argued that this is the case as all local minima are close to being globally optimal. We show that this is (almost) true, in fact almost all local minima are globally optimal, for a fully connected network with squared loss and analytic activation function given that the number of… 

Figures from this paper

Over-Parameterized Deep Neural Networks Have No Strict Local Minima For Any Continuous Activations

It is proved that for any continuous activation functions, the loss function has no bad strict local minimum, both in the regular sense and in the sense of sets, and this result holds for any convex and differentiable loss function.

Avoiding Spurious Local Minima in Deep Quadratic Networks.

The training landscape of the mean squared error loss for neural networks with quadratic activation functions is characterized and existence of spurious local minima and saddle points is proved, enabling convergence to a global minimum for these problems.

No Spurious Local Minima in Deep Quadratic Networks

It is proved that deep overparameterized neural networks with quadratic activations benefit from similar nice landscape properties and convergence to a global minimum for these problems is empirically demonstrated.


Surprisingly, necessary and sufficient conditions for a critical point of the risk function to be a global minimum provide an efficiently checkable test for global optimality, while such tests are typically intractable in nonconvex optimization.

Loss Surface Modality of Feed-Forward Neural Network Architectures

Fitness landscape analysis is employed to study the modality of neural network loss surfaces under various feed-forward architecture settings and an increase in the problem dimensionality is shown to yield a more searchable and more exploitable loss surface.

Global optimality conditions for deep neural networks

Surprisingly, necessary and sufficient conditions for a critical point of the risk function to be a global minimum provide an efficiently checkable test for global optimality, while such tests are typically intractable in nonconvex optimization.

Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology

It is proved that, for deep networks, a single layer of width following the input layer suffices to ensure a similar guarantee of global minimum for over-parameterized neural networks.

How degenerate is the parametrization of neural networks with the ReLU activation function?

The pathologies which prevent inverse stability in general are presented, and it is shown that by optimizing over such restricted sets, it is still possible to learn any function which can be learned by optimization over unrestricted sets.

Gradient Descent Finds Global Minima of Deep Neural Networks

The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet) and extends the analysis to deep residual convolutional neural networks and obtains a similar convergence result.

The loss surface and expressivity of deep convolutional neural networks

We analyze the expressiveness and loss surface of practical deep convolutional neural networks (CNNs) with shared weights and max pooling layers. We show that such CNNs produce linearly independent



Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs

This work provides the first global optimality guarantee of gradient descent on a convolutional neural network with ReLU activations, and shows that learning is NP-complete in the general case, but that when the input distribution is Gaussian, gradient descent converges to the global optimum in polynomial time.

On the Quality of the Initial Basin in Overspecified Neural Networks

This work studies thegeometric structure of the associated non-convex objective function, in the context of ReLU networks and starting from a random initialization of the network parameters, and identifies some conditions under which it becomes more favorable to optimization.

Open Problem: The landscape of the loss surfaces of multilayer networks

The question is whether it is possible to drop some of these assumptions to establish a stronger connection between both models.

Globally Optimal Training of Generalized Polynomial Neural Networks with Nonlinear Spectral Methods

This work shows under quite weak assumptions on the data that a particular class of feedforward neural networks can be trained globally optimal with a linear convergence rate with a nonlinear spectral method, the first practically feasible method which achieves such a guarantee.

Qualitatively characterizing neural network optimization problems

A simple analysis technique is introduced to look for evidence that state-of-the-art neural networks are overcoming local optima, and finds that, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.

Theory II: Landscape of the Empirical Risk in Deep Learning

This work proves in the regression framework the existence of a large number of degenerate global minimizers with zero empirical error and proposes an intuitive model of the landscape of DCNN's empirical loss surface, which might not be as complicated as people commonly believe.

Deep Learning without Poor Local Minima

In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. With no unrealistic assumption, we first

The Loss Surfaces of Multilayer Networks

It is proved that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.

How far can we go without convolution: Improving fully-connected networks

It is shown that a fully connected network can yield approximately 70% classification accuracy on the permutation-invariant CIFAR-10 task, which is much higher than the current state-of-the-art and 10% short of a decent convolutional network.