Corpus ID: 1211821

Policy Gradient Methods for Reinforcement Learning with Function Approximation

@inproceedings{Sutton1999PolicyGM,
  title={Policy Gradient Methods for Reinforcement Learning with Function Approximation},
  author={Richard S. Sutton and David A. McAllester and Satinder Singh and Y. Mansour},
  booktitle={NIPS},
  year={1999}
}
Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. [...] Key Method Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the…Expand
Policy Gradient Methods
A policy gradient method is a reinforcement learning approach that directly optimizes a parametrized control policy by gradient descent. It belongs to the class of policy search techniques thatExpand
Policy gradient methods
A policy gradient method is a reinforcement learning approach that directly optimizes a parametrized control policy by gradient descent. It belongs to the class of policy search techniques thatExpand
Approximating a Policy Can be Easier Than Approximating a Value Function
TLDR
Q-learning and a policy-only algorithm are compared, both using a simple neural network as the function approximator and both showing oscillation between the optimal policy and a sub-optimal one. Expand
The Local Optimality of Reinforcement Learning by Value Gradients, and its Relationship to Policy Gradient Learning
TLDR
It is shown that learning the gradient of the value- function at every point along a trajectory generated by a greedy policy is a sufficient condition for the trajectory to be locally extremal, and often locally optimal, and it is argued that this brings greater efficiency to value-function learning. Expand
Policy Gradient using Weak Derivatives for Reinforcement Learning
TLDR
This paper considers reinforcement learning for an infinite horizon discounted cost continuous state Markov decision process, a form of implicit stochastic adaptive control where the optimal control policy is estimated without directly estimating the underlying model parameters. Expand
Natural-Gradient Actor-Critic Algorithms
We prove the convergence of four new reinforcement learning algorithms based on the actorcritic architecture, on function approximation, and on natural gradients. Reinforcement learning is a class ofExpand
Policy Gradient using Weak Derivatives for Reinforcement Learning
TLDR
An alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established and the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem. Expand
Algorithmic Survey of Parametric Value Function Approximation
  • M. Geist, O. Pietquin
  • Mathematics, Computer Science
  • IEEE Transactions on Neural Networks and Learning Systems
  • 2013
TLDR
This survey reviews state-of-the-art methods for (parametric) value function approximation by grouping them into three main categories: bootstrapping, residual, and projected fixed-point approaches. Expand
Natural actor-critic algorithms
TLDR
Four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas are presented, and their convergence proofs are provided, providing the first convergence proofs and the first fully incremental algorithms. Expand
Sample-Efficient Evolutionary Function Approximation for Reinforcement Learning
TLDR
This work presents an enhancement to evolutionary function approximation that makes it much more sample-efficient by exploiting the off-policy nature of certain TD methods, and demonstrates that the enhanced method can learn better policies than evolution or TD methods alone and can do so in many fewer episodes than standard evolutionaryfunction approximation. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 32 REFERENCES
Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms
Despite their many empirical successes, approximate value -function based approaches to reinforcement learning suffer from a paucity of theoretical guarantees on the performance of the policyExpand
Stable Function Approximation in Dynamic Programming
TLDR
A proof of convergence is provided for a wide class of temporal difference methods involving function approximators such as k-nearest-neighbor, and it is shown experimentally that these methods can be useful. Expand
Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems
TLDR
This work proposes and analyze a new learning algorithm to solve a certain class of non-Markov decision problems and operates in the space of stochastic policies, a space which can yield a policy that performs considerably better than any deterministic policy. Expand
Direct gradient-based reinforcement learning
  • J. Baxter, P. Bartlett
  • Mathematics, Computer Science
  • 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353)
  • 2000
TLDR
An algorithm for computing approximations to the gradient of the average reward from a single sample path of a controlled partially observable Markov decision process is presented and it is proved that the algorithm converges with probability 1. Expand
An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function
TLDR
The results show that the algorithm is an extension of Williams' REINFORCE algorithms for in nite horizon reinforcement tasks, and then the critic provides an appropriate reinforcement baseline for the actor. Expand
Gradient Descent for General Reinforcement Learning
TLDR
A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcement-learning algorithms, and allows policy-search and value-based algorithms to be combined, thus unifying two very different approaches to reinforcement learning into a single Value and Policy Search algorithm. Expand
Residual Algorithms: Reinforcement Learning with Function Approximation
  • L. Baird
  • Mathematics, Computer Science
  • ICML
  • 1995
TLDR
Both direct and residual gradient algorithms are shown to be special cases of residual algorithms, and it is shown that residual algorithms can combine the advantages of each approach. Expand
Learning Without State-Estimation in Partially Observable Markovian Decision Processes
TLDR
A new framework for learning without state-estimation in POMDPs is developed by including stochastic policies in the search space, and by defining the value or utility of a distribution over states. Expand
Neuronlike adaptive elements that can solve difficult learning control problems
TLDR
It is shown how a system consisting of two neuronlike adaptive elements can solve a difficult learning control problem and the relation of this work to classical and instrumental conditioning in animal learning studies and its possible implications for research in the neurosciences. Expand
Introduction to Reinforcement Learning
TLDR
In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Expand
...
1
2
3
4
...