# High-Dimensional Continuous Control Using Generalized Advantage Estimation

@article{Schulman2016HighDimensionalCC, title={High-Dimensional Continuous Control Using Generalized Advantage Estimation}, author={John Schulman and Philipp Moritz and Sergey Levine and Michael I. Jordan and P. Abbeel}, journal={CoRR}, year={2016}, volume={abs/1506.02438} }

Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. We address the first challenge by using value functions to substantially reduce the…

## 1,510 Citations

Continuous Deep Q-Learning with Model-based Acceleration

- Computer ScienceICML
- 2016

This paper derives a continuous variant of the Q-learning algorithm, which it is called normalized advantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods, and substantially improves performance on a set of simulated robotic control tasks.

The Beta Policy for Continuous Control Reinforcement Learning

- 2017

Recently, reinforcement learning with deep neural networks has achieved great success in challenging continuous control problems such as 3D locomotion and robotic manipulation. However, in real-world…

Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution

- Computer ScienceICML
- 2017

It is shown that the Beta policy is bias-free and provides significantly faster convergence and higher scores over the Gaussian policy when both are used with trust region policy optimization and actor critic with experience replay, the state-of-the-art on- and off-policy stochastic methods respectively, on OpenAI Gym's and MuJoCo's continuous control environments.

An Empirical Analysis of Measure-Valued Derivatives for Policy Gradients

- Computer Science2021 International Joint Conference on Neural Networks (IJCNN)
- 2021

This work empirically evaluates this estimator in the actor-critic policy gradient setting and shows that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks, both in low and high-dimensional action spaces.

Proximal Policy Optimization with Continuous Bounded Action Space via the Beta Distribution

- Computer ScienceArXiv
- 2021

This work investigates how this Beta policy performs when it is trained by the Proximal Policy Optimization (PPO) algorithm on two continuous control tasks from OpenAI gym, and finds that the Beta policy is superior to the Gaussian policy in terms of agent’s final expected reward.

Marginal Policy Gradients for Complex Control

- Economics, Computer ScienceArXiv
- 2018

The marginal policy gradient framework is introduced, a powerful technique to obtain variance reduced policy gradients for arbitrary T and it is shown that marginal policyGradients are guaranteed to reduce variance, quantifying that reduction exactly.

Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator

- Computer Science, MathematicsICML
- 2018

This work gives the first finite-time analysis of the number of samples needed to estimate the value function for a fixed static state-feedback policy to within $\varepsilon$-relative error.

PODS: Policy Optimization via Differentiable Simulation

- Computer ScienceICML
- 2021

This paper explores a systematic way of leveraging the additional information provided by an emerging class of differentiable simulators to directly compute the analytic gradient of a policy’s value function with respect to the actions it outputs and shows that this approach consistently leads to better asymptotic behavior across a set of payload manipulation tasks that demand a high degree of accuracy and precision.

Reinforcement learning for control: Performance, stability, and deep approximators

- Computer ScienceAnnu. Rev. Control.
- 2018

This review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the control engineer, and explains how approximate representations of the solution make RL feasible for problems with continuous states and control actions.

MBVI: Model-Based Value Initialization for Reinforcement Learning

- 2020

Model-free reinforcement learning (RL) is capable of learning control policies for high-dimensional, complex robotic tasks, but tends to be data inefficient. Model-based RL and optimal control have…

## References

SHOWING 1-10 OF 35 REFERENCES

Real-time reinforcement learning by sequential Actor-Critics and experience replay

- Computer Science, MedicineNeural Networks
- 2009

It is formally shown that the resulting estimation bias is bounded and asymptotically vanishes, which allows the experience replay-augmented algorithm to preserve the convergence properties of the original algorithm.

Policy Gradient Methods for Reinforcement Learning with Function Approximation

- Mathematics, Computer ScienceNIPS
- 1999

This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

Stochastic policy gradient reinforcement learning on a simple 3D biped

- Engineering, Computer Science2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566)
- 2004

We present a learning system which is able to quickly and reliably acquire a robust feedback control policy for 3D dynamic walking from a blank-slate using only trials implemented on our physical…

Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

- Mathematics, Computer ScienceJ. Mach. Learn. Res.
- 2004

This paper considers variance reduction methods that were developed for Monte Carlo estimates of integrals, and gives bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties of the controlled system.

Continuous control with deep reinforcement learning

- Computer Science, MathematicsICLR
- 2016

This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.

Reinforcement Learning in POMDP's via Direct Gradient Ascent

- Mathematics, Computer ScienceICML
- 2000

GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy, is introduced and it is proved convergence of GPOMDP.

Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes

- Computer ScienceDiscret. Event Dyn. Syst.
- 2003

This work considers a discrete time, finite state Markov reward process that depends on a set of parameters and proposes two approaches to reduce the variance, which derive bounds for the resulting bias terms and characterize the asymptotic behavior of the resulting algorithms.

Natural Actor-Critic

- Sociology, Computer ScienceECML
- 2005

This paper investigates a novel model-free reinforcement learning architecture, the Natural Actor-Critic. The actor updates are based on stochastic policy gradients employing Amari's natural gradient…

Reinforcement learning in feedback control

- Computer ScienceMachine Learning
- 2011

This article focuses on the presentation of four typical benchmark problems whilst highlighting important and challenging aspects of technical process control: nonlinear dynamics; varying set-points; long-term dynamic effects; influence of external variables; and the primacy of precision.

Learning Continuous Control Policies by Stochastic Value Gradients

- Computer Science, MathematicsNIPS
- 2015

A unified framework for learning continuous control policies using backpropagation supported by treating stochasticity in the Bellman equation as a deterministic function of exogenous noise is presented.