Corpus ID: 6628106

# Adam: A Method for Stochastic Optimization

@article{Kingma2015AdamAM,
title={Adam: A Method for Stochastic Optimization},
author={Diederik P. Kingma and Jimmy Ba},
journal={CoRR},
year={2015},
volume={abs/1412.6980}
}
• Published 2015
• Computer Science, Mathematics
• CoRR
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. [...] Key Method The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence…Expand
72,778 Citations
On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization
• Computer Science, Mathematics
• ICLR
• 2019
A set of mild sufficient conditions are provided that guarantee the convergence for the Adam-type methods and it is proved that under these derived conditions, these methods can achieve the convergence rate of order $O(\log{T}/\sqrt{T})$ for nonconvex stochastic optimization. Expand
A Dynamic Sampling Adaptive-SGD Method for Machine Learning
• Computer Science, Mathematics
• ArXiv
• 2019
A stochastic optimization method that adaptively controls the batch size used in the computation of gradient approximations and the step size used to move along such directions, eliminating the need for the user to tune the learning rate is proposed. Expand
On Adam Trained Models and a Parallel Method to Improve the Generalization Performance
• Computer Science
• 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)
• 2018
This work analyzes Adam and SGD trained models for 7 popular neural network architectures for image classification tasks using the CIFAR-10 dataset and adopts a K-step model averaging parallel algorithm with the Adam optimizer to bridge the generalization gap. Expand
A Sufficient Condition for Convergences of Adam and RMSProp
• Fangyu Zou, Wei Liu
• Computer Science, Mathematics
• 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
• 2019
An alternative easy-to-check sufficient condition is introduced, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam/RMSProp for solving large-scale non-convex stochastic optimization. Expand
Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration
An alternative easy-to-check sufficient condition is introduced, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam for solving large-scale non-convex stochastic optimization. Expand
• Zhiming Zhou
• 2018
Adam is shown not being able to converge to the optimal solution in certain cases. Researchers recently propose several algorithms to avoid the issue of nonconvergence of Adam, but their efficiencyExpand
Convergence Guarantees for RMSProp and ADAM in Non-Convex Optimization and an Empirical Comparison to Nesterov Acceleration
• Computer Science, Mathematics
• 2018
This work provides proofs that these adaptive gradient algorithms are guaranteed to reach criticality for smooth non-convex objectives, and gives bounds on the running time of these algorithms. Expand
Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks
• Computer Science
• ArXiv
• 2021
Adam based variants based on the difference between the present and the past gradients, the step size is adjusted for each parameter and proposed ensemble obtains very high performance, it obtains accuracy comparable or better than actual state of the art. Expand
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization
• Computer Science, Mathematics
• ArXiv
• 2021
It is proved that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD when the gradients decrease rapidly, which may partially explain the good performance of ADAM in practice. Expand

#### References

SHOWING 1-10 OF 28 REFERENCES
• Computer Science, Mathematics
• J. Mach. Learn. Res.
• 2011
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight. Expand
Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods
• Mathematics, Computer Science
• ICML
• 2014
We present an algorithm for minimizing a sum of functions that combines the computational efficiency of stochastic gradient descent (SGD) with the second order curvature information leveraged byExpand
On the importance of initialization and momentum in deep learning
• Computer Science
• ICML
• 2013
It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization. Expand
Revisiting Natural Gradient for Deep Networks
• Computer Science, Mathematics
• ICLR
• 2014
It is described how one can use unlabeled data to improve the generalization error obtained by natural gradient and empirically evaluate the robustness of the algorithm to the ordering of the training set compared to stochastic gradient descent. Expand
Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
• Computer Science, Mathematics
• NIPS
• 2011
This work provides a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent as well as a simple modification where iterates are averaged, suggesting that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate, is not robust to the lack of strong convexity or the setting of the proportionality constant. Expand
Auto-Encoding Variational Bayes
• Mathematics, Computer Science
• ICLR
• 2014
A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced. Expand
Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
• Computer Science, Mathematics
• NIPS
• 2014
This paper proposes a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods, and applies this algorithm to deep or recurrent neural network training, and provides numerical evidence for its superior optimization performance. Expand
No more pesky learning rates
• Mathematics, Computer Science
• ICML
• 2013
The proposed method to automatically adjust multiple learning rates so as to minimize the expected error at any one time relies on local gradient variations across samples, making it suitable for non-stationary problems. Expand
A fast natural Newton method
• Computer Science
• ICML
• 2010
This paper investigates a natural way of combining the two directions of learning and optimization to yield fast and robust learning algorithms. Expand
Natural Gradient Works Efficiently in Learning
• S. Amari
• Computer Science, Mathematics
• Neural Computation
• 1998
The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. Expand