• Corpus ID: 238743919

On Convergence of Training Loss Without Reaching Stationary Points

  title={On Convergence of Training Loss Without Reaching Stationary Points},
  author={Jingzhao Zhang and Haochuan Li and Suvrit Sra and Ali Jadbabaie},
It is a well-known fact that nonconvex optimization is computationally intractable in the worst case. As a result, theoretical analysis of optimization algorithms such as gradient descent often focuses on local convergence to stationary points where the gradient norm is zero or negligible. In this work, we examine the disconnect between the existing theoretical analysis of gradient-based algorithms and actual practice. Specifically, we provide numerical evidence that in large-scale neural… 

Figures from this paper


On the convergence of single-call stochastic extra-gradient methods
A synthetic view of Extra-Gradient algorithms is developed, and it is shown that they retain a $\mathcal{O}(1/t)$ ergodic convergence rate in smooth, deterministic problems.
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
It is empirically demonstrated that full-batch gradient descent on neural network training objectives typically operates in a regime the authors call the Edge of Stability, which is inconsistent with several widespread presumptions in the field of optimization.
Rethinking the limiting dynamics of SGD: modified loss, phase space oscillations, and anomalous diffusion
This work finds empirically that long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance travelled grows as a power law in the number of gradient updates with a nontrivial exponent.
Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate
This paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints, and initiates a formal framework for studying their mathematics via suitable adaptation of the conventional framework namely, modeling SGD-induced training trajectory via a suitable stochastic differential equation with a noise term that captures gradient noise.
How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective
This analysis shows that learning rate and batch size play different roles in minima selection, which seems to correlate well with the theoretical findings and provide further support to these claims.
Vortices Instead of Equilibria in MinMax Optimization: Chaos and Butterfly Effects of Online Learning in Zero-Sum Games
It is proved that no meaningful prediction can be made about the day-to-day behavior of online learning dynamics in zero-sum games, and Chaos is robust to all affine variants of zero- sum games, network variants with arbitrary large number of agents and even to competitive settings beyond these.
ResNet strikes back: An improved training procedure in timm
This paper re-evaluate the performance of the vanilla ResNet-50 when trained with a procedure that integrates such advances, and shares competitive training settings and pre-trained models in the timm open-source library, with the hope that they will serve as better baselines for future work.
Deep Residual Learning for Image Recognition
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator
This paper proposes a new technique named SPIDER, which can be used to track many deterministic quantities of interest with significantly reduced computational cost and proves that SPIDER-SFO nearly matches the algorithmic lower bound for finding approximate first-order stationary points under the gradient Lipschitz assumption in the finite-sum setting.