• Corpus ID: 235196048

Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances

@article{cSimcsek2021GeometryOT,
  title={Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances},
  author={Berfin cSimcsek and Franccois Gaston Ged and Arthur Jacot and Francesco Spadaro and Cl{\'e}ment Hongler and Wulfram Gerstner and Johanni Brea},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.12221}
}
We study how permutation symmetries in overparameterized multi-layer neural networks generate ‘symmetry-induced’ critical points. Assuming a network with L layers of minimal widths r∗ 1 , . . . , r ∗ L−1 reaches a zero-loss minimum at r∗ 1 ! · · · r∗ L−1! isolated points that are permutations of one another, we show that adding one extra neuron to each layer is sufficient to connect all these previously discrete minima into a single manifold. For a two-layer overparameterized network of width r… 

Embedding Principle: a hierarchical structure of loss landscape of deep neural networks

TLDR
It is shown that loss landscape of an NN contains all critical points of all the narrower NNs, and an irreversiblility property of any critical embedding that the number of negative/zero/positive eigenvalues of the Hessian matrix of a critical point may increase but never decrease as an NNs becomes wider through the embedding.

Embedding Principle of Loss Landscape of Deep Neural Networks

TLDR
This work proves an embedding principle that the loss landscape of a DNN “contains” all the critical points of all the narrower DNNs, and proposes a critical embedding such that any critical point of a narrower Dnn can be embedded to a critical point/affine subspace of the target DNN with higher degeneracy and preserving the DNN output function.

Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity

TLDR
A Saddle-to-Saddle dynamics is conjecture: throughout training, gradient descent visits the neighborhoods of a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a sparse global minimum.

The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks

TLDR
If the permutation invariance of neural networks is taken into account, SGD solutions will likely have no barrier in the linear interpolation between them, which has implications for lottery ticket hypothesis, distributed training and ensemble methods.

Noether's Learning Dynamics: Role of Symmetry Breaking in Neural Networks

TLDR
A theoretical framework is developed to study the geometry of learning dynamics in neural networks, and a key mechanism of explicit symmetry breaking is revealed behind the efficiency and stability of modern neural networks.

Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks

TLDR
An embedding principle in depth is discovered that loss landscape of an NN “contains” all critical points of the loss landscapes for shallower NNs, which serves as a solid foundation for the further study about the role of depth for DNNs.

Random initialisations performing above chance and how to find them

TLDR
A simple but powerful algorithm is used to obtain direct empirical evidence that any two solutions found by SGD can be permuted such that the linear interpolation between their parameters forms a path without significant increases in loss.

Symmetry Teleportation for Accelerated Optimization

TLDR
This work derives the loss-invariant group actions for test functions and multi-layer neural networks, and proves a necessary condition of when teleportation improves convergence rate, and shows that the algorithm is closely related to second order methods.

Neural networks embrace learned diversity

Diversity conveys advantages in nature, yet homogeneous neurons typically comprise the layers of artificial neural networks. Here we construct neural networks from neurons that learn their own

A Topological Centrality Measure for Directed Networks

TLDR
A new metric for computing centrality in directed weighted networks, namely the quasi-centrality measure is introduced and a method that gives a hierarchical representation of the topological influences of nodes in a directed network is introduced.

References

SHOWING 1-10 OF 38 REFERENCES

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

TLDR
The geometric approach yields a lower bound on the number of critical points generated by weight-space symmetries and provides a simple intuitive link between previous mathematical results and numerical observations.

Topology and Geometry of Half-Rectified Network Optimization

TLDR
The main theoretical contribution is to prove that half-rectified single layer networks are asymptotically connected, and an algorithm is introduced to efficiently estimate the regularity of such sets on large-scale networks.

The critical locus of overparameterized neural networks

TLDR
The results in this paper provide a starting point to a more quantitative understanding of the properties of various components of the critical locus of the loss function $L$ of overparameterized feedforward neural networks.

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

TLDR
A case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but are in fact connected through their flat region and so belong to the same basin.

Large Scale Structure of Neural Network Loss Landscapes

TLDR
This work proposes and experimentally verify a unified phenomenological model of the loss landscape as a set of high dimensional wedges that together form a large-scale, inter-connected structure and towards which optimization is drawn.

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

TLDR
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

TLDR
This work shows that for wide NNs the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

TLDR
This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.

Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel

TLDR
A large-scale phenomenological analysis of training reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning.

Semi-flat minima and saddle points by embedding neural networks to overparameterization

TLDR
The results show that the networks with smooth and ReLU activation have different partially flat landscapes around the embedded point, and relate these results to a difference of their generalization abilities in overparameterized realization.