# Adam: A Method for Stochastic Optimization

@article{Kingma2015AdamAM, title={Adam: A Method for Stochastic Optimization}, author={Diederik P. Kingma and Jimmy Ba}, journal={CoRR}, year={2015}, volume={abs/1412.6980} }

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. [] Key Method The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence…

## 103,552 Citations

### On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

- Computer ScienceICLR
- 2019

A set of mild sufficient conditions are provided that guarantee the convergence for the Adam-type methods and it is proved that under these derived conditions, these methods can achieve the convergence rate of order $O(\log{T}/\sqrt{T})$ for nonconvex stochastic optimization.

### On Adam Trained Models and a Parallel Method to Improve the Generalization Performance

- Computer Science2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)
- 2018

This work analyzes Adam and SGD trained models for 7 popular neural network architectures for image classification tasks using the CIFAR-10 dataset and adopts a K-step model averaging parallel algorithm with the Adam optimizer to bridge the generalization gap.

### A Sufficient Condition for Convergences of Adam and RMSProp

- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019

An alternative easy-to-check sufficient condition is introduced, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam/RMSProp for solving large-scale non-convex stochastic optimization.

### Convergence Guarantees for RMSProp and ADAM in Non-Convex Optimization and an Empirical Comparison to Nesterov Acceleration

- Computer Science
- 2018

This work provides proofs that these adaptive gradient algorithms are guaranteed to reach criticality for smooth non-convex objectives, and gives bounds on the running time of these algorithms.

### Understanding AdamW through Proximal Methods and Scale-Freeness

- Computer ScienceArXiv
- 2022

This paper shows how to re-interpret AdamW as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal mapping of the regularizer instead of only utilizing its gradient information as in Adam- ℓ 2 .

### Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

- Computer ScienceArXiv
- 2021

This work introduces an alternative easy-to-check sufﬁcient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam for solving large-scale non-convex stochastic optimization.

### Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks

- Computer ScienceArXiv
- 2021

Adam based variants based on the difference between the present and the past gradients, the step size is adjusted for each parameter and proposed ensemble obtains very high performance, it obtains accuracy comparable or better than actual state of the art.

### An Adaptive Gradient Method with Energy and Momentum

- Computer ScienceAnnals of Applied Mathematics
- 2022

A novel algorithm for gradient-based optimization of stochastic objective functions that converges fast while generalizing better than or as well as SGD with momentum in training deep neural networks, and compares also favorably to Adam.

### Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization

- Computer ScienceArXiv
- 2021

It is shown that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization.

### Adam revisited: a weighted past gradients perspective

- Computer ScienceFrontiers of Computer Science
- 2020

It is proved that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD when the gradients decrease rapidly, which may partially explain the good performance of ADAM in practice.

## References

SHOWING 1-10 OF 29 REFERENCES

### Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

- Computer ScienceJ. Mach. Learn. Res.
- 2011

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

### Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods

- Computer ScienceICML
- 2014

We present an algorithm for minimizing a sum of functions that combines the computational efficiency of stochastic gradient descent (SGD) with the second order curvature information leveraged by…

### On the importance of initialization and momentum in deep learning

- Computer ScienceICML
- 2013

It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.

### Revisiting Natural Gradient for Deep Networks

- Computer ScienceICLR
- 2014

It is described how one can use unlabeled data to improve the generalization error obtained by natural gradient and empirically evaluate the robustness of the algorithm to the ordering of the training set compared to stochastic gradient descent.

### Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

- Computer Science, MathematicsNIPS
- 2011

This work provides a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent as well as a simple modification where iterates are averaged, suggesting that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate, is not robust to the lack of strong convexity or the setting of the proportionality constant.

### Auto-Encoding Variational Bayes

- Computer ScienceICLR
- 2014

A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.

### Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

- Computer ScienceNIPS
- 2014

This paper proposes a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods, and applies this algorithm to deep or recurrent neural network training, and provides numerical evidence for its superior optimization performance.

### No more pesky learning rates

- Computer ScienceICML
- 2013

The proposed method to automatically adjust multiple learning rates so as to minimize the expected error at any one time relies on local gradient variations across samples, making it suitable for non-stationary problems.

### A fast natural Newton method

- Computer ScienceICML
- 2010

This paper investigates a natural way of combining the two directions of learning and optimization to yield fast and robust learning algorithms.

### Natural Gradient Works Efficiently in Learning

- Computer ScienceNeural Computation
- 1998

The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters.