• Corpus ID: 3075448

High-Dimensional Continuous Control Using Generalized Advantage Estimation

@article{Schulman2016HighDimensionalCC,
  title={High-Dimensional Continuous Control Using Generalized Advantage Estimation},
  author={John Schulman and Philipp Moritz and Sergey Levine and Michael I. Jordan and P. Abbeel},
  journal={CoRR},
  year={2016},
  volume={abs/1506.02438}
}
Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. We address the first challenge by using value functions to substantially reduce the… 
Continuous Deep Q-Learning with Model-based Acceleration
TLDR
This paper derives a continuous variant of the Q-learning algorithm, which it is called normalized advantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods, and substantially improves performance on a set of simulated robotic control tasks.
The Beta Policy for Continuous Control Reinforcement Learning
Recently, reinforcement learning with deep neural networks has achieved great success in challenging continuous control problems such as 3D locomotion and robotic manipulation. However, in real-world
Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution
TLDR
It is shown that the Beta policy is bias-free and provides significantly faster convergence and higher scores over the Gaussian policy when both are used with trust region policy optimization and actor critic with experience replay, the state-of-the-art on- and off-policy stochastic methods respectively, on OpenAI Gym's and MuJoCo's continuous control environments.
An Empirical Analysis of Measure-Valued Derivatives for Policy Gradients
TLDR
This work empirically evaluates this estimator in the actor-critic policy gradient setting and shows that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks, both in low and high-dimensional action spaces.
Proximal Policy Optimization with Continuous Bounded Action Space via the Beta Distribution
TLDR
This work investigates how this Beta policy performs when it is trained by the Proximal Policy Optimization (PPO) algorithm on two continuous control tasks from OpenAI gym, and finds that the Beta policy is superior to the Gaussian policy in terms of agent’s final expected reward.
Marginal Policy Gradients for Complex Control
TLDR
The marginal policy gradient framework is introduced, a powerful technique to obtain variance reduced policy gradients for arbitrary T and it is shown that marginal policyGradients are guaranteed to reduce variance, quantifying that reduction exactly.
Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator
TLDR
This work gives the first finite-time analysis of the number of samples needed to estimate the value function for a fixed static state-feedback policy to within $\varepsilon$-relative error.
PODS: Policy Optimization via Differentiable Simulation
TLDR
This paper explores a systematic way of leveraging the additional information provided by an emerging class of differentiable simulators to directly compute the analytic gradient of a policy’s value function with respect to the actions it outputs and shows that this approach consistently leads to better asymptotic behavior across a set of payload manipulation tasks that demand a high degree of accuracy and precision.
Reinforcement learning for control: Performance, stability, and deep approximators
TLDR
This review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the control engineer, and explains how approximate representations of the solution make RL feasible for problems with continuous states and control actions.
MBVI: Model-Based Value Initialization for Reinforcement Learning
Model-free reinforcement learning (RL) is capable of learning control policies for high-dimensional, complex robotic tasks, but tends to be data inefficient. Model-based RL and optimal control have
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 35 REFERENCES
Real-time reinforcement learning by sequential Actor-Critics and experience replay
TLDR
It is formally shown that the resulting estimation bias is bounded and asymptotically vanishes, which allows the experience replay-augmented algorithm to preserve the convergence properties of the original algorithm.
Policy Gradient Methods for Reinforcement Learning with Function Approximation
TLDR
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Stochastic policy gradient reinforcement learning on a simple 3D biped
  • Russ Tedrake, T. Zhang, H. Seung
  • Engineering, Computer Science
    2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566)
  • 2004
We present a learning system which is able to quickly and reliably acquire a robust feedback control policy for 3D dynamic walking from a blank-slate using only trials implemented on our physical
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
TLDR
This paper considers variance reduction methods that were developed for Monte Carlo estimates of integrals, and gives bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties of the controlled system.
Continuous control with deep reinforcement learning
TLDR
This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
Reinforcement Learning in POMDP's via Direct Gradient Ascent
TLDR
GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy, is introduced and it is proved convergence of GPOMDP.
Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes
TLDR
This work considers a discrete time, finite state Markov reward process that depends on a set of parameters and proposes two approaches to reduce the variance, which derive bounds for the resulting bias terms and characterize the asymptotic behavior of the resulting algorithms.
Natural Actor-Critic
This paper investigates a novel model-free reinforcement learning architecture, the Natural Actor-Critic. The actor updates are based on stochastic policy gradients employing Amari's natural gradient
Reinforcement learning in feedback control
TLDR
This article focuses on the presentation of four typical benchmark problems whilst highlighting important and challenging aspects of technical process control: nonlinear dynamics; varying set-points; long-term dynamic effects; influence of external variables; and the primacy of precision.
Learning Continuous Control Policies by Stochastic Value Gradients
TLDR
A unified framework for learning continuous control policies using backpropagation supported by treating stochasticity in the Bellman equation as a deterministic function of exogenous noise is presented.
...
1
2
3
4
...