Proximal Policy Gradient: PPO with Policy Gradient
@article{Byun2020ProximalPG, title={Proximal Policy Gradient: PPO with Policy Gradient}, author={Ju-Seung Byun and Byungmoon Kim and Huamin Wang}, journal={ArXiv}, year={2020}, volume={abs/2010.09933} }
In this paper, we propose a new algorithm PPG (Proximal Policy Gradient), which is close to both VPG (vanilla policy gradient) and PPO (proximal policy optimization). The PPG objective is a partial variation of the VPG objective and the gradient of the PPG objective is exactly same as the gradient of the VPG objective. To increase the number of policy update iterations, we introduce the advantage-policy plane and design a new clipping strategy. We perform experiments in OpenAI Gym and Bullet…
4 Citations
Neural PPO-Clip Attains Global Optimality: A Hinge Loss Perspective
- Computer Science
- 2021
This paper establishes the first global convergence rate of PPO-Clip under neural function approximation and proposes a two-step policy improvement scheme, which facilitates the convergence analysis by decoupling policy search from the complex neural policy parameterization with the help of entropic mirror descent and a regression-based policy update scheme.
Recursive Least Squares Advantage Actor-Critic Algorithms
- Computer ScienceArXiv
- 2022
Two novel RLS-based A2C algorithms are proposed, called RLSSA2C and RLSNA2C, which use the RLS method to train the critic network and the hidden layers of the actor network and have better sample efficiency and higher computational efficiency than other two state-of-the-art algorithms.
DQNAS: Neural Architecture Search using Reinforcement Learning
- Computer ScienceArXiv
- 2023
An automated Neural Architecture Search framework DQNAS is proposed, guided by the principles of Reinforcement Learning along with One-shot Training which aims to generate neural network architectures that show superior performance and have minimum scalability problem.
Hinge Policy Optimization: Reinterpreting PPO-Clip and Attaining Global Optimality
- Computer Science
- 2022
This paper proposes to rethink policy optimization based on the principle of hinge policy optimization (HPO), called to achieve policy improvement by solving a large-margin classification problem with hinge loss, and thereby reinterprets PPO-clip as an instance of HPO.
References
SHOWING 1-10 OF 24 REFERENCES
Proximal Policy Optimization Algorithms
- Computer ScienceArXiv
- 2017
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective…
Policy Gradient Methods for Reinforcement Learning with Function Approximation
- Computer ScienceNIPS
- 1999
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Trust Region-Guided Proximal Policy Optimization
- Computer ScienceNeurIPS
- 2019
This paper proposes a novel policy optimization method, named Trust Region-Guided PPO (TRGPPO), which adaptively adjusts the clipping range within the trust region, and formally shows that this method not only improves the exploration ability within thetrust region but enjoys a better performance bound compared to the original PPO.
Implementation Matters in Deep RL: A Case Study on PPO and TRPO
- Computer ScienceICLR
- 2020
The results show that algorithm augmentations found only in implementations or described as auxiliary details to the core algorithm are responsible for most of PPO's gain in cumulative reward over TRPO, and fundamentally change how RL methods function.
Trust Region Policy Optimization
- Computer ScienceICML
- 2015
A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).
Continuous control with deep reinforcement learning
- Computer ScienceICLR
- 2016
This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
Understanding the impact of entropy on policy optimization
- Computer ScienceICML
- 2019
New tools for understanding the optimization landscape are presented, it is shown that policy entropy serves as a regularizer, and the challenge of designing general-purpose policy optimization algorithms is highlighted.
High-Dimensional Continuous Control Using Generalized Advantage Estimation
- Computer ScienceICLR
- 2016
This work addresses the large number of samples typically required and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias.
Regularization Matters in Policy Optimization - An Empirical Study on Continuous Control
- Computer ScienceICLR
- 2021
This work presents the first comprehensive study of regularization techniques with multiple policy optimization algorithms on continuous control tasks and discusses and analyze why regularization may help generalization in RL from four perspectives: sample complexity, reward distribution, weight norm, and noise robustness.
Addressing Function Approximation Error in Actor-Critic Methods
- Computer ScienceICML
- 2018
This paper builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation, and draws the connection between target networks and overestimation bias.