• Corpus ID: 235239754

Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

  title={Loss landscapes and optimization in over-parameterized non-linear systems and neural networks},
  author={Chaoyue Liu and Libin Zhu and Mikhail Belkin},
The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization… 
Subquadratic Overparameterization for Shallow Neural Networks
This work provides an analytical framework that allows us to adopt standard initialization strategies, possibly avoid lazy training, and train all layers simultaneously in basic shallow neural networks while attaining a desirable subquadratic scaling on the network width.
A framework for overparameterized learning
This work proposes a framework consisting of a prototype learning problem, which is general enough to cover many popular problems and even the cases of infinitely wide neural networks and in-situ data, and demonstrates that supervised learning, variational autoencoders and training with gradient penalty can be translated to the prototype problem.
Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time
This work proposes a framework that uses m cost only in the initialization phase and achieves a truly subquadratic cost per iteration in terms of m, i.e., m per iteration, and makes use of various techniques, including a shifted ReLU-based sparsifier, a lazy low rank maintenance data structure, fast rectangular matrix multiplication, tensor-based sketching techniques and preconditioning.
Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent
A non-convex optimization framework for the analysis of neural network training is proposed and it is shown that stochastic gradient descent on objectives satisfying proxy convexity or the proxy Polyak-Lojasiewicz inequality leads to efficient guarantees for proxy objective functions.
On generalization bounds for deep networks based on loss surface implicit regularization
This work argues that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces implicit regularization and results in tighter bounds on the generalization error for deep neural networks.
On Feature Learning in Neural Networks with Global Convergence Guarantees
A model of wide multi-layer NNs whose second-to-last layer is trained via GF is studied, for which it is proved that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
Stability & Generalisation of Gradient Descent for Shallow Neural Networks without the Neural Tangent Kernel
We revisit on-average algorithmic stability of Gradient Descent (GD) for training overparameterised shallow neural networks and prove new generalisation and excess risk bounds without the Neural
Momentum Diminishes the Effect of Spectral Bias in Physics-Informed Neural Networks
This work exploits neural tangent kernels (NTKs) to investigate the training dynamics of PINNs evolving under stochastic gradient descent with momentum (SGDM), and demonstrates SGDM significantly reduces the effect of spectral bias.
Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs
A precise description of the gradient dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation and a proof that the process follows a specific saddle to saddle dynamics is presented.
Convergence of gradient descent for deep neural networks
A new criterion for convergence of gradient descent to a global minimum is presented, which is provably more powerful than the best available criteria from the literature, namely, the Łojasiewicz inequality and its generalizations.