• Corpus ID: 238743919

# On Convergence of Training Loss Without Reaching Stationary Points

@article{Zhang2021OnCO,
title={On Convergence of Training Loss Without Reaching Stationary Points},
author={Jingzhao Zhang and Haochuan Li and Suvrit Sra and Ali Jadbabaie},
journal={ArXiv},
year={2021},
volume={abs/2110.06256}
}
It is a well-known fact that nonconvex optimization is computationally intractable in the worst case. As a result, theoretical analysis of optimization algorithms such as gradient descent often focuses on local convergence to stationary points where the gradient norm is zero or negligible. In this work, we examine the disconnect between the existing theoretical analysis of gradient-based algorithms and actual practice. Specifically, we provide numerical evidence that in large-scale neural…

## References

SHOWING 1-10 OF 24 REFERENCES
• Computer Science, Mathematics
ICLR
• 2020
It is shown that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks, and positively correlates with the gradient norm, and contrary to standard assumptions in the literature.
On the convergence of single-call stochastic extra-gradient methods
• Computer Science, Mathematics
NeurIPS
• 2019
A synthetic view of Extra-Gradient algorithms is developed, and it is shown that they retain a $\mathcal{O}(1/t)$ ergodic convergence rate in smooth, deterministic problems.
Adam: A Method for Stochastic Optimization
• Computer Science, Mathematics
ICLR
• 2015
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
• Computer Science, Mathematics
ICLR
• 2021
It is empirically demonstrated that full-batch gradient descent on neural network training objectives typically operates in a regime the authors call the Edge of Stability, which is inconsistent with several widespread presumptions in the field of optimization.
Rethinking the limiting dynamics of SGD: modified loss, phase space oscillations, and anomalous diffusion
This work finds empirically that long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance travelled grows as a power law in the number of gradient updates with a nontrivial exponent.
On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay
• Computer Science, Mathematics
ArXiv
• 2021
This work rigorously investigates the mechanism underlying the discovered periodic behavior of optimization dynamics and demonstrates that periodic behavior can be regarded as a generalization of two previously opposing perspectives on training with batch normalization and weight decay, namely the equilibrium presumption and the instability presumption.
Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate
• Computer Science
NeurIPS
• 2020
This paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints, and initiates a formal framework for studying their mathematics via suitable adaptation of the conventional framework namely, modeling SGD-induced training trajectory via a suitable stochastic differential equation with a noise term that captures gradient noise.
How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective
The question of which global minima are accessible by a stochastic gradient decent (SGD) algorithm with specific learning rate and batch size is studied from the perspective of dynamical stability.
Vortices Instead of Equilibria in MinMax Optimization: Chaos and Butterfly Effects of Online Learning in Zero-Sum Games
• Computer Science, Mathematics
COLT
• 2019
It is proved that no meaningful prediction can be made about the day-to-day behavior of online learning dynamics in zero-sum games, and Chaos is robust to all affine variants of zero- sum games, network variants with arbitrary large number of agents and even to competitive settings beyond these.
ResNet strikes back: An improved training procedure in timm
• Computer Science
ArXiv
• 2021
This paper re-evaluate the performance of the vanilla ResNet-50 when trained with a procedure that integrates such advances, and shares competitive training settings and pre-trained models in the timm open-source library, with the hope that they will serve as better baselines for future work.