• Corpus ID: 235239754

Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

  title={Loss landscapes and optimization in over-parameterized non-linear systems and neural networks},
  author={Chaoyue Liu and Libin Zhu and Mikhail Belkin},
The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization… 
Subquadratic Overparameterization for Shallow Neural Networks
This work provides an analytical framework that allows us to adopt standard initialization strategies, possibly avoid lazy training, and train all layers simultaneously in basic shallow neural networks while attaining a desirable subquadratic scaling on the network width.
Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time
This work proposes a framework that uses m cost only in the initialization phase and achieves a truly subquadratic cost per iteration in terms of m, i.e., m per iteration, and makes use of various techniques, including a shifted ReLU-based sparsifier, a lazy low rank maintenance data structure, fast rectangular matrix multiplication, tensor-based sketching techniques and preconditioning.
Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent
A unified non-convex optimization framework for the analysis of neural network training is proposed and it is shown that stochastic gradient descent on objectives satisfying proxy convexity or the proxy Polyak-Lojasiewicz inequality leads to efficient guarantees for proxy objective functions.
On generalization bounds for deep networks based on loss surface implicit regularization
This work argues that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces implicit regularization and results in tighter bounds on the generalization error for deep neural networks.
On Feature Learning in Neural Networks with Global Convergence Guarantees
A model of wide multi-layer NNs whose second-to-last layer is trained via GF is studied, for which it is proved that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
Stability & Generalisation of Gradient Descent for Shallow Neural Networks without the Neural Tangent Kernel
We revisit on-average algorithmic stability of Gradient Descent (GD) for training overparameterised shallow neural networks and prove new generalisation and excess risk bounds without the Neural
Convergence of gradient descent for deep neural networks
A new criterion for convergence of gradient descent to a global minimum is presented, which is provably more powerful than the best available criteria from the literature, namely, the Łojasiewicz inequality and its generalizations.
Faster Single-loop Algorithms for Minimax Optimization without Strong Concavity
New convergence results for two alternative single-loop algorithms – alternating GDA and smoothed GDA – under the mild assumption that the objective satisfies the PolyakLojasiewicz (PL) condition about one variable are established.
Improved Overparametrization Bounds for Global Convergence of Stochastic Gradient Descent for Shallow Neural Networks
We study the overparametrization bounds required for the global convergence of stochastic gradient descent algorithm for a class of one hidden layer feed-forward neural networks, considering most of
Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation
Just as a physical prism separates colours mixed within a ray of light, the figurative prism of interpolation helps to disentangle generalization and optimization properties within the complex picture of modern machine learning.