# Adam: A Method for Stochastic Optimization

@article{Kingma2015AdamAM, title={Adam: A Method for Stochastic Optimization}, author={Diederik P. Kingma and Jimmy Ba}, journal={CoRR}, year={2015}, volume={abs/1412.6980} }

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. [...] Key Method The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence… Expand

#### 72,778 Citations

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

- Computer Science, Mathematics
- ICLR
- 2019

A set of mild sufficient conditions are provided that guarantee the convergence for the Adam-type methods and it is proved that under these derived conditions, these methods can achieve the convergence rate of order $O(\log{T}/\sqrt{T})$ for nonconvex stochastic optimization. Expand

A Dynamic Sampling Adaptive-SGD Method for Machine Learning

- Computer Science, Mathematics
- ArXiv
- 2019

A stochastic optimization method that adaptively controls the batch size used in the computation of gradient approximations and the step size used to move along such directions, eliminating the need for the user to tune the learning rate is proposed. Expand

On Adam Trained Models and a Parallel Method to Improve the Generalization Performance

- Computer Science
- 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)
- 2018

This work analyzes Adam and SGD trained models for 7 popular neural network architectures for image classification tasks using the CIFAR-10 dataset and adopts a K-step model averaging parallel algorithm with the Adam optimizer to bridge the generalization gap. Expand

A Sufficient Condition for Convergences of Adam and RMSProp

- Computer Science, Mathematics
- 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019

An alternative easy-to-check sufficient condition is introduced, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam/RMSProp for solving large-scale non-convex stochastic optimization. Expand

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

- Computer Science, Mathematics
- ArXiv
- 2021

An alternative easy-to-check sufficient condition is introduced, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam for solving large-scale non-convex stochastic optimization. Expand

ADAPTIVE LEARNING RATE METHODS

- 2018

Adam is shown not being able to converge to the optimal solution in certain cases. Researchers recently propose several algorithms to avoid the issue of nonconvergence of Adam, but their efficiency… Expand

Convergence Guarantees for RMSProp and ADAM in Non-Convex Optimization and an Empirical Comparison to Nesterov Acceleration

- Computer Science, Mathematics
- 2018

This work provides proofs that these adaptive gradient algorithms are guaranteed to reach criticality for smooth non-convex objectives, and gives bounds on the running time of these algorithms. Expand

Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks

- Computer Science
- ArXiv
- 2021

Adam based variants based on the difference between the present and the past gradients, the step size is adjusted for each parameter and proposed ensemble obtains very high performance, it obtains accuracy comparable or better than actual state of the art. Expand

Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization

- Computer Science, Mathematics
- ArXiv
- 2021

Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can… Expand

Adam revisited: a weighted past gradients perspective

- Computer Science, Mathematics
- Frontiers of Computer Science
- 2020

It is proved that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD when the gradients decrease rapidly, which may partially explain the good performance of ADAM in practice. Expand

#### References

SHOWING 1-10 OF 28 REFERENCES

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

- Computer Science, Mathematics
- J. Mach. Learn. Res.
- 2011

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight. Expand

Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods

- Mathematics, Computer Science
- ICML
- 2014

We present an algorithm for minimizing a sum of functions that combines the computational efficiency of stochastic gradient descent (SGD) with the second order curvature information leveraged by… Expand

On the importance of initialization and momentum in deep learning

- Computer Science
- ICML
- 2013

It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization. Expand

Revisiting Natural Gradient for Deep Networks

- Computer Science, Mathematics
- ICLR
- 2014

It is described how one can use unlabeled data to improve the generalization error obtained by natural gradient and empirically evaluate the robustness of the algorithm to the ordering of the training set compared to stochastic gradient descent. Expand

Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

- Computer Science, Mathematics
- NIPS
- 2011

This work provides a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent as well as a simple modification where iterates are averaged, suggesting that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate, is not robust to the lack of strong convexity or the setting of the proportionality constant. Expand

Auto-Encoding Variational Bayes

- Mathematics, Computer Science
- ICLR
- 2014

A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced. Expand

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

- Computer Science, Mathematics
- NIPS
- 2014

This paper proposes a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods, and applies this algorithm to deep or recurrent neural network training, and provides numerical evidence for its superior optimization performance. Expand

No more pesky learning rates

- Mathematics, Computer Science
- ICML
- 2013

The proposed method to automatically adjust multiple learning rates so as to minimize the expected error at any one time relies on local gradient variations across samples, making it suitable for non-stationary problems. Expand

A fast natural Newton method

- Computer Science
- ICML
- 2010

This paper investigates a natural way of combining the two directions of learning and optimization to yield fast and robust learning algorithms. Expand

Natural Gradient Works Efficiently in Learning

- Computer Science, Mathematics
- Neural Computation
- 1998

The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. Expand