• Corpus ID: 6628106

# Adam: A Method for Stochastic Optimization

@article{Kingma2015AdamAM,
title={Adam: A Method for Stochastic Optimization},
author={Diederik P. Kingma and Jimmy Ba},
journal={CoRR},
year={2015},
volume={abs/1412.6980}
}
• Published 22 December 2014
• Computer Science
• CoRR
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. [] Key Method The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence…
103,552 Citations

## Figures from this paper

### On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

• Computer Science
ICLR
• 2019
A set of mild sufficient conditions are provided that guarantee the convergence for the Adam-type methods and it is proved that under these derived conditions, these methods can achieve the convergence rate of order $O(\log{T}/\sqrt{T})$ for nonconvex stochastic optimization.

### On Adam Trained Models and a Parallel Method to Improve the Generalization Performance

• Computer Science
2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)
• 2018
This work analyzes Adam and SGD trained models for 7 popular neural network architectures for image classification tasks using the CIFAR-10 dataset and adopts a K-step model averaging parallel algorithm with the Adam optimizer to bridge the generalization gap.

### A Sufficient Condition for Convergences of Adam and RMSProp

• Fangyu ZouWei Liu
• Computer Science
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
• 2019
An alternative easy-to-check sufficient condition is introduced, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam/RMSProp for solving large-scale non-convex stochastic optimization.

### Convergence Guarantees for RMSProp and ADAM in Non-Convex Optimization and an Empirical Comparison to Nesterov Acceleration

• Computer Science
• 2018
This work provides proofs that these adaptive gradient algorithms are guaranteed to reach criticality for smooth non-convex objectives, and gives bounds on the running time of these algorithms.

### Understanding AdamW through Proximal Methods and Scale-Freeness

• Computer Science
ArXiv
• 2022
This paper shows how to re-interpret AdamW as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal mapping of the regularizer instead of only utilizing its gradient information as in Adam- ℓ 2 .

### Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

• Computer Science
ArXiv
• 2021
This work introduces an alternative easy-to-check sufﬁcient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam for solving large-scale non-convex stochastic optimization.

### Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks

• Computer Science
ArXiv
• 2021
Adam based variants based on the difference between the present and the past gradients, the step size is adjusted for each parameter and proposed ensemble obtains very high performance, it obtains accuracy comparable or better than actual state of the art.

### An Adaptive Gradient Method with Energy and Momentum

• Computer Science
Annals of Applied Mathematics
• 2022
A novel algorithm for gradient-based optimization of stochastic objective functions that converges fast while generalizing better than or as well as SGD with momentum in training deep neural networks, and compares also favorably to Adam.

### Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization

• Computer Science
ArXiv
• 2021
It is shown that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization.

### Adam revisited: a weighted past gradients perspective

• Computer Science
Frontiers of Computer Science
• 2020
It is proved that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD when the gradients decrease rapidly, which may partially explain the good performance of ADAM in practice.

## References

SHOWING 1-10 OF 29 REFERENCES

### Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

• Computer Science
J. Mach. Learn. Res.
• 2011
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

### Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods

• Computer Science
ICML
• 2014
We present an algorithm for minimizing a sum of functions that combines the computational efficiency of stochastic gradient descent (SGD) with the second order curvature information leveraged by

### On the importance of initialization and momentum in deep learning

• Computer Science
ICML
• 2013
It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.

### Revisiting Natural Gradient for Deep Networks

• Computer Science
ICLR
• 2014
It is described how one can use unlabeled data to improve the generalization error obtained by natural gradient and empirically evaluate the robustness of the algorithm to the ordering of the training set compared to stochastic gradient descent.

### Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

• Computer Science, Mathematics
NIPS
• 2011
This work provides a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent as well as a simple modification where iterates are averaged, suggesting that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate, is not robust to the lack of strong convexity or the setting of the proportionality constant.

### Auto-Encoding Variational Bayes

• Computer Science
ICLR
• 2014
A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.

### Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

• Computer Science
NIPS
• 2014
This paper proposes a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods, and applies this algorithm to deep or recurrent neural network training, and provides numerical evidence for its superior optimization performance.

### No more pesky learning rates

• Computer Science
ICML
• 2013
The proposed method to automatically adjust multiple learning rates so as to minimize the expected error at any one time relies on local gradient variations across samples, making it suitable for non-stationary problems.

### A fast natural Newton method

• Computer Science
ICML
• 2010
This paper investigates a natural way of combining the two directions of learning and optimization to yield fast and robust learning algorithms.

### Natural Gradient Works Efficiently in Learning

• S. Amari
• Computer Science
Neural Computation
• 1998
The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters.