# The Loss Surfaces of Multilayer Networks

@inproceedings{Choromaska2015TheLS, title={The Loss Surfaces of Multilayer Networks}, author={Anna Choromańska and Mikael Henaff and Micha{\"e}l Mathieu and G{\'e}rard Ben Arous and Yann LeCun}, booktitle={AISTATS}, year={2015} }

We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural network through the prism of the results from random matrix theory. We show that for large-size decoupled…

## Figures, Tables, and Topics from this paper

## 909 Citations

Topology and Geometry of Half-Rectified Network Optimization

- Computer Science, MathematicsICLR
- 2017

The main theoretical contribution is to prove that half-rectified single layer networks are asymptotically connected, and an algorithm is introduced to efficiently estimate the regularity of such sets on large-scale networks.

The loss surface of deep linear networks viewed through the algebraic geometry lens

- Computer Science, MedicineIEEE transactions on pattern analysis and machine intelligence
- 2021

It is shown that in the presence of the non-zero regularization, deep linear networks indeed possess local minima which are not the global minima, and that though the number of stationary points increases as thenumber of neurons (regularization parameter) increases (decreases), theNumber of higher index saddles are surprisingly rare.

Topology and Geometry of Deep Rectified Network Optimization Landscapes

- Computer Science
- 2016

The theoretical work quantifies and formalizes two important folklore facts and introduces an algorithm to efficiently estimate the regularity of such sets on large-scale networks and shows that these level sets remain connected throughout all the learning phase, suggesting a near convex behavior, but they become exponentially more curvy as the energy level decays.

Pure and Spurious Critical Points: a Geometric Study of Linear Networks

- Computer Science, MathematicsICLR
- 2020

This analysis clearly illustrates that the absence of "bad" local minima in the loss landscape of linear networks is due to two distinct phenomena that apply in different settings.

Open Problem: The landscape of the loss surfaces of multilayer networks

- Computer Science, MathematicsCOLT
- 2015

The question is whether it is possible to drop some of these assumptions to establish a stronger connection between both models.

A Critical View of Global Optimality in Deep Learning

- Mathematics, Computer ScienceArXiv
- 2018

It is shown that for deep linear networks with differentiable losses, critical points after the multilinear parameterization inherit the structure of critical points of the underlying loss with linear parameterization, and it is proved that for almost all practical datasets there exist infinitely many local minima that are not global.

Piecewise Strong Convexity of Neural Networks

- Computer Science, MathematicsNeurIPS
- 2019

The loss surface of a feed-forward neural network with ReLU non-linearities, regularized with weight decay, is studied to prove that local minima of the regularized loss function in this set are isolated, and that every differentiable critical point inThis set is a local minimum.

Nonlinearities in activations substantially shape the loss surfaces of neural networks

- Computer ScienceICLR 2020
- 2020

It is proved that the loss surface of every neural network has infinite spuriouslocal minima, which are defined as the local minima with higher empirical risks than the global minima.

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

- Mathematics, Computer ScienceArXiv
- 2017

This paper enrichs the current understanding of the landscape of the square loss functions for three types of neural networks by providing an explicit characterization of the global minimizers for linear networks, linear residual networks, and nonlinear networks with one hidden layer.

Piecewise linear activations substantially shape the loss surfaces of neural networks

- Computer Science, MathematicsICLR
- 2020

It is proved that the loss surfaces of many neural networks have infinite spuriousLocal minima, which are defined as the local minima with higher empirical risks than the global minima and it is proven that all local minata in a cell constitute an equivalence class.

## References

SHOWING 1-10 OF 28 REFERENCES

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

- Computer Science, PhysicsICLR
- 2014

It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

- Computer Science, MathematicsNIPS
- 2014

This paper proposes a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods, and applies this algorithm to deep or recurrent neural network training, and provides numerical evidence for its superior optimization performance.

Neural networks and principal component analysis: Learning from examples without local minima

- Mathematics, Computer ScienceNeural Networks
- 1989

The main result is a complete description of the landscape attached to E in terms of principal component analysis, showing that E has a unique minimum corresponding to the projection onto the subspace generated by the first principal vectors of a covariance matrix associated with the training patterns.

Complexity of random smooth functions on the high-dimensional sphere

- Mathematics, Physics
- 2013

We analyze the landscape of general smooth Gaussian functions on the sphere in dimension N, when N is large. We give an explicit formula for the asymptotic complexity of the mean number of critical…

Random Matrices and complexity of Spin Glasses

- Mathematics, Physics
- 2010

We give an asymptotic evaluation of the complexity of spherical p-spin spin-glass models via random matrix theory. This study enables us to obtain detailed information about the bottom of the energy…

Replica Symmetry Breaking Condition Exposed by Random Matrix Calculation of Landscape Complexity

- Physics, Mathematics
- 2007

Abstract
We start with a rather detailed, general discussion of recent results of the replica approach to statistical mechanics of a single classical particle placed in a random N(≫1)-dimensional…

Spin-glass models of neural networks.

- Physics, MedicinePhysical review. A, General physics
- 1985

Two dynamical models, proposed by Hopfield and Little to account for the collective behavior of neural networks, are analyzed and it is shown that the long-time behavior of the two models is identical, for all temperatures below a transition temperature ${T}_{c}$.

Mean-field theory for a spin-glass model of neural networks: TAP free energy and the paramagnetic to spin-glass transition

- Physics
- 1997

An approach is proposed to the Hopfield model where the mean-field treatment is made for a given set of stored patterns (sample) and then the statistical average over samples is taken. This…

ImageNet classification with deep convolutional neural networks

- Computer ScienceCommun. ACM
- 2012

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

On the Distribution of the Roots of Certain Symmetric Matrices

- Mathematics
- 1958

The present article is concerned with the distribution of the latent roots (characteristic values) of certain sets of real symmetric matrices of very high dimensionality. Its purpose is to point out…