# On the Global Convergence of Momentum-based Policy Gradient

@article{Ding2021OnTG, title={On the Global Convergence of Momentum-based Policy Gradient}, author={Yuhao Ding and Junzi Zhang and Javad Lavaei}, journal={ArXiv}, year={2021}, volume={abs/2110.10116} }

Policy gradient (PG) methods are popular and efficient for large-scale reinforcement learning due to their relative stability and incremental nature. In recent years, the empirical success of PG methods has led to the development of a theoretical foundation for these methods. In this work, we generalize this line of research by establishing the first set of global convergence results of stochastic PG methods with momentum terms, which have been demonstrated to be efficient recipes for improving…

## Tables from this paper

## 5 Citations

A general sample complexity analysis of vanilla policy gradient

- Computer ScienceAISTATS
- 2022

It is shown that the ABC assumption is more general than the commonly used assumptions on the policy space to prove convergence to a stationary point, and a novel global optimum convergence theory of PG is established with e O ( ǫ − 3 ) sample complexity.

Adaptive Momentum-Based Policy Gradient with Second-Order Information

- Computer ScienceArXiv
- 2022

This work proposes a variance reduced policy gradient method, called SGDHess-PG, which incorporates second-order information into stochastic gradient descent (SGD) using momentum with an adaptive learning rate.

On the Global Optimum Convergence of Momentum-based Policy Gradient

- Computer Science, MathematicsAISTATS
- 2022

Policy gradient (PG) methods are popular and eﬃcient for large-scale reinforcement learning due to their relative stability and in-cremental nature. In recent years, the empirical success of PG…

Understanding Curriculum Learning in Policy Optimization for Solving Combinatorial Optimization Problems

- Computer ScienceArXiv
- 2022

It is shown that CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs) and proved convergence bounds on natural policy gradient (NPG) for solving LMDPs, and the theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in the theorem.

Bregman Gradient Policy Optimization

- Computer ScienceArXiv
- 2021

It is proved that BGPO achieves the sample complexity of Õ( −4) for finding -stationary point only requiring one trajectory at each iteration, and VR-BGPO reaches the best known sample complexity for finding an - stationary point.

## References

SHOWING 1-10 OF 82 REFERENCES

On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method

- Computer ScienceNeurIPS
- 2021

A Stochastic Incremental Variance-Reduced Policy Gradient (SIVR-PG) approach that improves a sequence of policies to provably converge to the global optimal solution and finds an -optimal policy using Õ( −2) samples.

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

- Mathematics, Computer ScienceNeurIPS
- 2020

This paper revisits and improves the convergence of policy gradient, natural PG (NPG) methods, and their variance-reduced variants, under general smooth policy parametrizations, and proposes SRVR-NPG, which incorporates variancereduction into the NPG update.

Momentum-Based Policy Gradient Methods

- Computer ScienceICML
- 2020

A class of efficient momentum-based policy gradient methods for the model-free reinforcement learning, which use adaptive learning rates and do not require any large batches and reach the best known sample complexity of $O(\epsilon^{-3})$ without any large batch.

Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator

- Computer ScienceICML
- 2018

This work bridges the gap showing that (model free) policy gradient methods globally converge to the optimal solution and are efficient (polynomially so in relevant problem dependent quantities) with regards to their sample and computational complexities.

Infinite-Horizon Policy-Gradient Estimation

- Computer ScienceJ. Artif. Intell. Res.
- 2001

GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies, is introduced.

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

- Computer ScienceNeurIPS
- 2020

A new Variational Policy Gradient Theorem for RL with general utilities is derived, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function.

Hessian Aided Policy Gradient

- Computer ScienceICML
- 2019

This paper presents a Hessian aided policy gradient method with the first improved sample complexity of O(1/ ), which can be implemented in linear time with respect to the parameter dimension and is hence applicable to sophisticated DNN parameterization.

Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines

- Computer ScienceICLR
- 2018

The experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks and high-dimensional hand manipulation and synthetic tasks, and the general idea of including additional information in baselines for improved variance reduction can be extended to partially observed and multi-agent tasks.

Global Optimality Guarantees For Policy Gradient Methods

- Computer ScienceArXiv
- 2019

This work identifies structural properties -- shared by finite MDPs and several classic control problems -- which guarantee that policy gradient objective function has no suboptimal local minima despite being non-convex.

An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient

- Computer ScienceUAI
- 2019

An improved convergence analysis of SVRPG is provided and it is shown that it can find an $\epsilon$-approximate stationary point of the performance function within $O(1/\ep silon^{5/3})$ trajectories, and sample complexity improves upon the best known result.