# Infinite-Horizon Policy-Gradient Estimation

@article{Baxter2001InfiniteHorizonPE, title={Infinite-Horizon Policy-Gradient Estimation}, author={Jonathan Baxter and Peter L. Bartlett}, journal={J. Artif. Intell. Res.}, year={2001}, volume={15}, pages={319-350} }

Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic…

## 697 Citations

Experiments with Infinite-Horizon, Policy-Gradient Estimation

- Computer ScienceJ. Artif. Intell. Res.
- 2001

Algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP) based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs.

Policy Gradient in Continuous Time

- Computer ScienceJ. Mach. Learn. Res.
- 2005

This paper shows that usual likelihood ratio methods used in discrete-time, deterministic state dynamics, fail to proceed the gradient because they are subject to variance explosion, and describes an alternative approach based on the approximation of the pathwise derivative that leads to a policy gradient estimate that converges almost surely to the true gradient when the time-step tends to 0.

Derivatives of Logarithmic Stationary Distributions for Policy Gradient Reinforcement Learning

- Computer ScienceNeural Computation
- 2010

A method for estimating the log stationary state distribution derivative (LSD) as a useful form of the derivative of the stationary state distributions through backward Markov chain formulation and a temporal difference learning framework is proposed.

A Study of Policy Gradient on a Class of Exactly Solvable Models

- MathematicsArXiv
- 2020

This paper constructs a class of novel partially observable environments with controllable exploration difficulty, in which the value distribution, and hence the policy parameter evolution, can be derived analytically, for a special class of exactly solvable POMDPs.

A Markov chain Monte Carlo algorithm for Bayesian policy search

- Computer Science
- 2018

A Bayesian approach to policy search under RL paradigm is taken, for the problem of controlling a discrete time Markov decision process with continuous state and action spaces and with a multiplicative reward structure.

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

- MathematicsUAI
- 2005

It is shown that the critic can be implemented using temporal difference (TD) methods with linear function approximations, and the analytical results on TD and Actor-Critic can be transfered to this case.

Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

- Computer Science
- 2003

This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting including an application written for the Bunyip cluster that won the international Gordon-Bell prize for price/performance in 2001.

Internal-State Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

- Computer Science
- 2002

A new policy-gradient algorithm is presented that uses an explicit model of the POMDP to estimate gradients, and its effectiveness on problems with tens of thousands of states is demonstrated.

Regularization in reinforcement learning

- Computer Science
- 2011

It is proved that the regularization-based Approximate Value/Policy Iteration algorithms introduced in this thesis enjoys an oracle-like property and it may be used to achieve adaptivity: the performance is almost as good as the performance of the unknown best parameters.

Smoothing Policies and Safe Policy Gradients

- Computer ScienceArXiv
- 2019

This paper addresses a specific safety formulation, where danger is encoded in the reward signal and the learning agent is constrained to never worsen its performance, and establishes improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies.

## References

SHOWING 1-10 OF 78 REFERENCES

Experiments with Infinite-Horizon, Policy-Gradient Estimation

- Computer ScienceJ. Artif. Intell. Res.
- 2001

Algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP) based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs.

Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems

- Computer ScienceNIPS
- 1994

This work proposes and analyze a new learning algorithm to solve a certain class of non-Markov decision problems and operates in the space of stochastic policies, a space which can yield a policy that performs considerably better than any deterministic policy.

Learning Without State-Estimation in Partially Observable Markovian Decision Processes

- Computer Science, MathematicsICML
- 1994

Policy Gradient Methods for Reinforcement Learning with Function Approximation

- Computer ScienceNIPS
- 1999

This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning

- Computer Science, MathematicsJ. Comput. Syst. Sci.
- 2002

This paper provides a convergence rate for the estimates produced by GPOMDP and gives an improved bound on the approximation error of these estimates, both of these bounds are in terms of mixing times of the POMDP.

The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs

- MathematicsOper. Res.
- 1978

The paper develops easily implemented approximations to stationary policies based on finitely transient policies and shows that the concave hull of an approximation can be included in the well-known Howard policy improvement algorithm with subsequent convergence.

Actor-Critic Algorithms

- Computer ScienceNIPS
- 1999

This thesis proposes and studies actor-critic algorithms which combine the above two approaches with simulation to find the best policy among a parameterized class of policies, and proves convergence of the algorithms for problems with general state and decision spaces.

Gradient Descent for General Reinforcement Learning

- Computer ScienceNIPS
- 1998

A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcement-learning algorithms, and allows policy-search and value-based algorithms to be combined, thus unifying two very different approaches to reinforcement learning into a single Value and Policy Search algorithm.

The Optimal Control of Partially Observable Markov Processes over a Finite Horizon

- MathematicsOper. Res.
- 1973

If there are only a finite number of control intervals remaining, then the optimal payoff function is a piecewise-linear, convex function of the current state probabilities of the internal Markov process, and an algorithm for utilizing this property to calculate the optimal control policy and payoff function for any finite horizon is outlined.