• Corpus ID: 239024759

# On the Global Convergence of Momentum-based Policy Gradient

@article{Ding2021OnTG,
title={On the Global Convergence of Momentum-based Policy Gradient},
author={Yuhao Ding and Junzi Zhang and Javad Lavaei},
journal={ArXiv},
year={2021},
volume={abs/2110.10116}
}
• Published 2021
• Computer Science
• ArXiv
Policy gradient (PG) methods are popular and efficient for large-scale reinforcement learning due to their relative stability and incremental nature. In recent years, the empirical success of PG methods has led to the development of a theoretical foundation for these methods. In this work, we generalize this line of research by establishing the first set of global convergence results of stochastic PG methods with momentum terms, which have been demonstrated to be efficient recipes for improving…
5 Citations

## Tables from this paper

A general sample complexity analysis of vanilla policy gradient
• Computer Science
AISTATS
• 2022
It is shown that the ABC assumption is more general than the commonly used assumptions on the policy space to prove convergence to a stationary point, and a novel global optimum convergence theory of PG is established with e O ( ǫ − 3 ) sample complexity.
Adaptive Momentum-Based Policy Gradient with Second-Order Information
• Computer Science
ArXiv
• 2022
This work proposes a variance reduced policy gradient method, called SGDHess-PG, which incorporates second-order information into stochastic gradient descent (SGD) using momentum with an adaptive learning rate.
On the Global Optimum Convergence of Momentum-based Policy Gradient
• Yuhao Ding, Junzi Zhang
• Computer Science, Mathematics
AISTATS
• 2022
Policy gradient (PG) methods are popular and eﬃcient for large-scale reinforcement learning due to their relative stability and in-cremental nature. In recent years, the empirical success of PG
Understanding Curriculum Learning in Policy Optimization for Solving Combinatorial Optimization Problems
• Computer Science
ArXiv
• 2022
It is shown that CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs) and proved convergence bounds on natural policy gradient (NPG) for solving LMDPs, and the theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in the theorem.
Bregman Gradient Policy Optimization
• Computer Science
ArXiv
• 2021
It is proved that BGPO achieves the sample complexity of Õ( −4) for finding -stationary point only requiring one trajectory at each iteration, and VR-BGPO reaches the best known sample complexity for finding an - stationary point.

## References

SHOWING 1-10 OF 82 REFERENCES
On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method
• Computer Science
NeurIPS
• 2021
A Stochastic Incremental Variance-Reduced Policy Gradient (SIVR-PG) approach that improves a sequence of policies to provably converge to the global optimal solution and finds an -optimal policy using Õ( −2) samples.
An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods
• Mathematics, Computer Science
NeurIPS
• 2020
This paper revisits and improves the convergence of policy gradient, natural PG (NPG) methods, and their variance-reduced variants, under general smooth policy parametrizations, and proposes SRVR-NPG, which incorporates variancereduction into the NPG update.
Momentum-Based Policy Gradient Methods
• Computer Science
ICML
• 2020
A class of efficient momentum-based policy gradient methods for the model-free reinforcement learning, which use adaptive learning rates and do not require any large batches and reach the best known sample complexity of $O(\epsilon^{-3})$ without any large batch.
Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator
• Computer Science
ICML
• 2018
This work bridges the gap showing that (model free) policy gradient methods globally converge to the optimal solution and are efficient (polynomially so in relevant problem dependent quantities) with regards to their sample and computational complexities.
• Computer Science
J. Artif. Intell. Res.
• 2001
GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies, is introduced.
Variational Policy Gradient Method for Reinforcement Learning with General Utilities
• Computer Science
NeurIPS
• 2020
A new Variational Policy Gradient Theorem for RL with general utilities is derived, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function.
Hessian Aided Policy Gradient
• Computer Science
ICML
• 2019
This paper presents a Hessian aided policy gradient method with the first improved sample complexity of O(1/ ), which can be implemented in linear time with respect to the parameter dimension and is hence applicable to sophisticated DNN parameterization.
Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines
• Computer Science
ICLR
• 2018
The experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks and high-dimensional hand manipulation and synthetic tasks, and the general idea of including additional information in baselines for improved variance reduction can be extended to partially observed and multi-agent tasks.
Global Optimality Guarantees For Policy Gradient Methods
• Computer Science
ArXiv
• 2019
This work identifies structural properties -- shared by finite MDPs and several classic control problems -- which guarantee that policy gradient objective function has no suboptimal local minima despite being non-convex.
An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient
• Computer Science
UAI
• 2019
An improved convergence analysis of SVRPG is provided and it is shown that it can find an $\epsilon$-approximate stationary point of the performance function within $O(1/\ep silon^{5/3})$ trajectories, and sample complexity improves upon the best known result.