• Corpus ID: 28695052

Proximal Policy Optimization Algorithms

  title={Proximal Policy Optimization Algorithms},
  author={John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov},
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. [] Key Method The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments…

Figures and Tables from this paper

Memory-Constrained Policy Optimization

A new constrained optimization method for policy gradient reinforcement learning, which uses two trust regions to regulate each policy update, and a mechanism to automatically build the virtual policy from a memory of past policies, providing a new capability for dynamically selecting appropriate trust regions during the optimization process.

Proximal Policy Optimization Smoothed Algorithm

This work presents a PPO variant, named Proximal Policy Optimization Smooth Algorithm (PPOS), and its critical improvement is the use of a functional clipping method instead of a flat clipping method, and it is proved that this method can conduct more accurate updates at each time step than other PPO methods.

Supervised Policy Update for Deep Reinforcement Learning

This work proposes a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning, which formulates and solves a constrained optimization problem in the non-parameterized proximal policy space, and converts the ideal policy to a parameterized policy, from which it draws new samples.

Supervised Policy Update

A methodology for finding an optimal policy in the non-parameterized policy space, and it is shown how Trust Region Policy Optimization (TRPO) and Proximal Policyoptimization (PPO) can be addressed by this methodology.

Evolved Policy Gradients

Empirical results show that the evolved policy gradient algorithm (EPG) achieves faster learning on several randomized environments compared to an off-the-shelf policy gradient method, and its learned loss can generalize to out-of-distribution test time tasks, and exhibits qualitatively different behavior from other popular metalearning algorithms.

Batch Reinforcement Learning Through Continuation Method

This work proposes a simple yet effective policy iteration approach to batch RL using global optimization techniques known as continuation, constraining the difference between the learned policy and the behavior policy that generates the fixed trajectories, and continuously relaxing the constraint.

A Nonparametric Off-Policy Policy Gradient

This work constructs a nonparametric Bellman equation in a principled manner, which allows for closed-form estimates of the value function, and to analytically express the full policy gradient, and demonstrates better sample efficiency than state-of-the-art policy gradient methods.

Pareto Policy Adaptation

This work introduces Pareto Policy Adaptation ( PPA), a loss function that adapts the policy to be optimal with respect to any distribution over preferences, and uses implicit differentiation to back-propagate the loss gradient bypassing the operations of the projected gradient descent solver.

Trajectory-Based Off-Policy Deep Reinforcement Learning

The results show that the proposed algorithm is able to successfully and reliably learn solutions using fewer system interactions than standard policy gradient methods, and is amenable to standard neural network optimization strategies.

Multi-Objective Exploration for Proximal Policy Optimization

This work proposes a model learning the designated reward in numerous conditions that alleviates the dependence on the reward design by executing the Preferent Surrogate Objective (PSO) and makes full use of Curiosity Driven Exploration to increase exploration ability.



Trust Region Policy Optimization

A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).

High-Dimensional Continuous Control Using Generalized Advantage Estimation

This work addresses the large number of samples typically required and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias.

Asynchronous Methods for Deep Reinforcement Learning

A conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers and shows that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.

Emergence of Locomotion Behaviours in Rich Environments

This paper explores how a rich environment can help to promote the learning of complex behavior, and finds that this encourages the emergence of robust behaviours that perform well across a suite of tasks.

Benchmarking Deep Reinforcement Learning for Continuous Control

This work presents a benchmark suite of continuous control tasks, including classic tasks like cart-pole swing-up, tasks with very high state and action dimensionality such as 3D humanoid locomotion, task with partial observations, and tasks with hierarchical structure.

Simple statistical gradient-following algorithms for connectionist reinforcement learning

This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units that are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reInforcement tasks, and they do this without explicitly computing gradient estimates.

Human-level control through deep reinforcement learning

This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

MuJoCo: A physics engine for model-based control

A new physics engine tailored to model-based control, based on the modern velocity-stepping approach which avoids the difficulties with spring-dampers, which can compute both forward and inverse dynamics.

Learning Tetris Using the Noisy Cross-Entropy Method

Noise is applied for preventing early convergence of the cross-entropy method, using Tetris, a computer game, for demonstration, and the resulting policy outperforms previous RL algorithms by almost two orders of magnitude.

OpenAI Gym

This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software.