• Corpus ID: 211678320

Conjugate-gradient-based Adam for stochastic optimization and its application to deep learning

  title={Conjugate-gradient-based Adam for stochastic optimization and its application to deep learning},
  author={Yu Kobayashi and Hideaki Iiduka},
This paper proposes a conjugate-gradient-based Adam algorithm blending Adam with nonlinear conjugate gradient methods and shows its convergence analysis. Numerical experiments on text classification and image classification show that the proposed algorithm can train deep neural network models in fewer epochs than the existing adaptive stochastic optimization algorithms can. 

Figures from this paper

Adaptive Learning Rate and Momentum for Training Deep Neural Networks

A fast training method motivated by the nonlinear Conjugate Gradient with Quadratic line-search (CGQ) framework that yields faster convergence than other local solvers and has better generalization capability (test set accuracy).

Training Deep Neural Networks Using Conjugate Gradient-like Methods

An iterative algorithm combining the existing adaptive learning rate optimization algorithms with conjugate gradient-like methods, which are useful for constrained optimization is devised, which shows that the proposed algorithm with a constant learning rate is superior for training neural networks.



Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

A Nonlinear Conjugate Gradient Method with a Strong Global Convergence Property

This paper presents a new version of the conjugate gradient method, which converges globally, provided the line search satisfies the standard Wolfe conditions.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions

A generic decoupling technique is presented that enables us to provide Rademacher complexity-based generalization error bounds and a novel memory efficient online learning algorithm for higher order learning problems with bounded regret guarantees is proposed.

Stochastic Fixed Point Optimization Algorithm for Classifier Ensemble

  • H. Iiduka
  • Computer Science, Mathematics
    IEEE Transactions on Cybernetics
  • 2020
It is shown that the classifier ensemble problem can be formulated as a convex stochastic optimization problem over the fixed point set of a quasi-nonexpansive mapping and high classification accuracies of the proposed algorithms are demonstrated through numerical comparisons with the conventional algorithm.


This paper reviews the development of dierent versions of nonlinear conjugate gradient methods, with special attention given to global convergence properties.

On the momentum term in gradient descent learning algorithms

  • N. Qian
  • Physics, Computer Science
    Neural Networks
  • 1999

Mask R-CNN

This work presents a conceptually simple, flexible, and general framework for object instance segmentation that outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners.

Methods of conjugate gradients for solving linear systems

An iterative algorithm is given for solving a system Ax=k of n linear equations in n unknowns and it is shown that this method is a special case of a very general method which also includes Gaussian elimination.