• Corpus ID: 224803437

Proximal Policy Gradient: PPO with Policy Gradient

  title={Proximal Policy Gradient: PPO with Policy Gradient},
  author={Ju-Seung Byun and Byungmoon Kim and Huamin Wang},
In this paper, we propose a new algorithm PPG (Proximal Policy Gradient), which is close to both VPG (vanilla policy gradient) and PPO (proximal policy optimization). The PPG objective is a partial variation of the VPG objective and the gradient of the PPG objective is exactly same as the gradient of the VPG objective. To increase the number of policy update iterations, we introduce the advantage-policy plane and design a new clipping strategy. We perform experiments in OpenAI Gym and Bullet… 

Figures and Tables from this paper

Neural PPO-Clip Attains Global Optimality: A Hinge Loss Perspective

This paper establishes the first global convergence rate of PPO-Clip under neural function approximation and proposes a two-step policy improvement scheme, which facilitates the convergence analysis by decoupling policy search from the complex neural policy parameterization with the help of entropic mirror descent and a regression-based policy update scheme.

Recursive Least Squares Advantage Actor-Critic Algorithms

Two novel RLS-based A2C algorithms are proposed, called RLSSA2C and RLSNA2C, which use the RLS method to train the critic network and the hidden layers of the actor network and have better sample efficiency and higher computational efficiency than other two state-of-the-art algorithms.

DQNAS: Neural Architecture Search using Reinforcement Learning

An automated Neural Architecture Search framework DQNAS is proposed, guided by the principles of Reinforcement Learning along with One-shot Training which aims to generate neural network architectures that show superior performance and have minimum scalability problem.

Hinge Policy Optimization: Reinterpreting PPO-Clip and Attaining Global Optimality

This paper proposes to rethink policy optimization based on the principle of hinge policy optimization (HPO), called to achieve policy improvement by solving a large-margin classification problem with hinge loss, and thereby reinterprets PPO-clip as an instance of HPO.



Proximal Policy Optimization Algorithms

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective

Policy Gradient Methods for Reinforcement Learning with Function Approximation

This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

Trust Region-Guided Proximal Policy Optimization

This paper proposes a novel policy optimization method, named Trust Region-Guided PPO (TRGPPO), which adaptively adjusts the clipping range within the trust region, and formally shows that this method not only improves the exploration ability within thetrust region but enjoys a better performance bound compared to the original PPO.

Implementation Matters in Deep RL: A Case Study on PPO and TRPO

The results show that algorithm augmentations found only in implementations or described as auxiliary details to the core algorithm are responsible for most of PPO's gain in cumulative reward over TRPO, and fundamentally change how RL methods function.

Trust Region Policy Optimization

A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).

Continuous control with deep reinforcement learning

This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.

Understanding the impact of entropy on policy optimization

New tools for understanding the optimization landscape are presented, it is shown that policy entropy serves as a regularizer, and the challenge of designing general-purpose policy optimization algorithms is highlighted.

High-Dimensional Continuous Control Using Generalized Advantage Estimation

This work addresses the large number of samples typically required and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias.

Regularization Matters in Policy Optimization - An Empirical Study on Continuous Control

This work presents the first comprehensive study of regularization techniques with multiple policy optimization algorithms on continuous control tasks and discusses and analyze why regularization may help generalization in RL from four perspectives: sample complexity, reward distribution, weight norm, and noise robustness.

Addressing Function Approximation Error in Actor-Critic Methods

This paper builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation, and draws the connection between target networks and overestimation bias.