Adam revisited: a weighted past gradients perspective

@article{Zhong2020AdamRA,
  title={Adam revisited: a weighted past gradients perspective},
  author={Hui Zhong and Zaiyi Chen and Chuan Qin and Zai Huang and Vincent Wenchen Zheng and Tong Xu and Enhong Chen},
  journal={Frontiers of Computer Science},
  year={2020},
  volume={14},
  pages={1-16}
}
Adaptive learning rate methods have been successfully applied in many fields, especially in training deep neural networks. Recent results have shown that adaptive methods with exponential increasing weights on squared past gradients (i.e., ADAM, RMSPROP) may fail to converge to the optimal solution. Though many algorithms, such as AMSGRAD and ADAMNC, have been proposed to fix the non-convergence issues, achieving a data-dependent regret bound similar to or better than ADAGRAD is still a… 
A Decreasing Scaling Transition Scheme from Adam to SGD
TLDR
A decreasing scaling transition scheme to achieve a smooth and stable transition from Adam to SGD, which is called DSTAdam is proposed and verified on the CIFAR-10/100 datasets.
Comparative study of optimization techniques in deep learning: Application in the ophthalmology field
TLDR
A comparative study of stochastic, momentum, Nesterov, AdaGrad, RMSProp, AdaDelta, Adam, AdaMax and Nadam gradient descent algorithms based on the speed of convergence of these different algorithms, as well as the mean absolute error of each algorithm in the generation of an optimization solution is presented.
Gradient-based Learning Methods Extended to Smooth Manifolds Applied to Automated Clustering
Grassmann manifold based sparse spectral clustering is a classification technique that consists in learning a latent representation of data, formed by a subspace basis, which is sparse. In order to
A Multivariate and Multistage Streamflow Prediction Model Based on Signal Decomposition Techniques with Deep Learning
TLDR
The results show that the proposed model have good prediction skills, and the prediction results of multistage models are better than single-stage models; however, the most complex models do not have the best results.
Escaping the Big Data Paradigm with Compact Transformers
TLDR
This paper shows for the first time that with the right size, convolutional tokenization, transformers can avoid over-tting and outperform state-of-the-art CNNs on small datasets and presents an approach for small-scale learning by introducing Compact Transformers.
Flexible Transmitter Network
TLDR
This study provides an alternative basic building block in neural networks and exhibits the feasibility of developing artificial neural networks with neuronal plasticity.
Image Based Malware Classification with Multimodal Deep Learning
TLDR
A novel multimodal convolutional neural network-based deep learning architecture and singular value decomposition-based image feature extraction method are proposed to classify malware files using intermediate-level feature fusion.
Predicting epileptic seizures with a stacked long short-term memory network
TLDR
A sophisticated method of detection that correlates a wearers movement against 12 seizurerelated activities prior to formulating a prediction is presented, which successfully differentiated the types of movement.
A neoteric ensemble deep learning network for musculoskeletal disorder classification
TLDR
A comparison has been accomplished based on different learning rates, drop-out rates, and optimizers to gauge the up-to-the-mark performance of the proposed ensemble deep learning architecture.
Enhanced DSSM (deep semantic structure modelling) technique for job recommendation
  • Ravita Mishra, S. Rathi
  • Computer Science
    Journal of King Saud University - Computer and Information Sciences
  • 2021
...
...

References

SHOWING 1-10 OF 35 REFERENCES
Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks
TLDR
This work designs a new algorithm, called Partially adaptive momentum estimation method, which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter $p$, to achieve the best from both worlds.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate
TLDR
NosAdam can be regarded as a fix to the non-convergence issue of Adam in alternative to the recent work ofReddi et al., 2018 and preliminary numerical experiments show that NosAdam is a promising alternative algorithm to Adam.
SADAGRAD: Strongly Adaptive Stochastic Gradient Methods
TLDR
This work proposes a simple yet novel variant of ADAGRAD for stochastic (weakly) strongly convex optimization and develops a variant that is adaptive to the (implicit) strong convexity from the data, which together makes the proposed algorithm strongly adaptive.
Variants of RMSProp and Adagrad with Logarithmic Regret Bounds
TLDR
This paper analyzed RMSProp, originally proposed for the training of deep neural networks, in the context of online convex optimization and show $\sqrt{T}$-type regret bounds and proposes two variants SC-Adagrad and SC-RMSProp for which logarithmic regret bounds for strongly convex functions are shown.
Why Does Stagewise Training Accelerate Convergence of Testing Error Over SGD?
TLDR
The proposed algorithm has additional favorable features that come with theoretical guarantee for the considered non-convex optimization problems, including using explicit algorithmic regularization at each stage, using stagewise averaged solution for restarting, and returning the last stagewising averaged solution as the final solution.
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
TLDR
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
signSGD: compressed optimisation for non-convex problems
TLDR
SignSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate, and the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models.
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization
TLDR
This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis.
Understanding the difficulty of training deep feedforward neural networks
TLDR
The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
...
...