• Corpus ID: 119115800

Real-Time Reinforcement Learning of Constrained Markov Decision Processes with Weak Derivatives.

@article{Krishnamurthy2011RealTimeRL,
  title={Real-Time Reinforcement Learning of Constrained Markov Decision Processes with Weak Derivatives.},
  author={Vikram Krishnamurthy and Felisa V{\'a}zquez Abad},
  journal={arXiv: Optimization and Control},
  year={2011}
}
We present on-line policy gradient algorithms for computing the locally optimal policy of a constrained, average cost, finite state Markov Decision Process. The stochastic approximation algorithms require estimation of the gradient of the cost function with respect to the parameter that characterizes the randomized policy. We propose a spherical coordinate parametrization and present a novel simulation based gradient estimation scheme involving weak derivatives (measure-valued differentiation… 

Figures from this paper

Policy Gradient using Weak Derivatives for Reinforcement Learning

An alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established and the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem.

Langevin Dynamics for Inverse Reinforcement Learning of Stochastic Gradient Algorithms

A generalized Langevin dynamics algorithm to estimate the reward function ofverse reinforcement learning is presented; specifically, the resulting Langevin algorithm asymptotically generates samples from the distribution proportional to $\exp(R(\theta)$).

Langevin Dynamics for Adaptive Inverse Reinforcement Learning of Stochastic Gradient Algorithms

This paper considers IRL when noisy estimates of the gradient of a reward function generated by multiple stochastic gradient agents are observed, and presents a generalized Langevin dynamics algorithm to estimate the reward function R(θ); specifically, the resulting Langevin algorithm asymptotically generates samples from the distribution proportional to exp(R( θ).

An Analysis of Measure-Valued Derivatives for Policy Gradients

It is shown that the Measure-Valued Derivative estimator can be a useful alternative to other policy gradient estimators, which is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators.

An Empirical Analysis of Measure-Valued Derivatives for Policy Gradients

This work empirically evaluates this estimator in the actor-critic policy gradient setting and shows that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks, both in low and high-dimensional action spaces.

How to Calibrate Your Adversary's Capabilities? Inverse Filtering for Counter-Autonomous Systems

We consider an adversarial Bayesian signal processing problem involving “us” and an “adversary”. The adversary observes our state in noise; updates its posterior distribution of our state and then

References

SHOWING 1-10 OF 51 REFERENCES

Self Learning Control of Constrained Markov Decision Processes - A Gradient Approach

Stochastic approximation algorithms for computing the locally optimal policy of a constrained average cost finite state Markov Decision process and can handle constraints and time varying parameters are presented.

Implementation of gradient estimation to a constrained Markov decision problem

This paper identifies the asymptotic bias of the stochastic approximation of the constrained optimization method for the constrained Markov Decision Process, and proposes several means to correct it.

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms

This paper presents an algorithm for computing approximations to the gradient of the average reward from a single sample path of the underlying Ma rkov chain and extends this algorithm to the case of partially observabl e Markov decision processes controlled by stochastic polici es.

Strong points of weak convergence: a study using RPA gradient estimation for automatic learning,

Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning

This paper provides a convergence rate for the estimates produced by GPOMDP and gives an improved bound on the approximation error of these estimates, both of these bounds are in terms of mixing times of the POMDP.

Regime Switching Stochastic Approximation Algorithms with Application to Adaptive Discrete Stochastic Optimization

By a combined use of the SA method and two-time-scale Markov chains, asymptotic properties of the algorithm are obtained, which are distinct from the usual SA techniques.

Structured Threshold Policies for Dynamic Sensor Scheduling—A Partially Observed Markov Decision Process Approach

This work gives sufficient conditions on the cost function, dynamics of the Markov chain and observation probabilities so that the optimal scheduling policy has a threshold structure with respect to a monotone likelihood ratio (MLR) ordering.

Direct gradient-based reinforcement learning

  • Jonathan BaxterP. Bartlett
  • Computer Science
    2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353)
  • 2000
An algorithm for computing approximations to the gradient of the average reward from a single sample path of a controlled partially observable Markov decision process is presented and it is proved that the algorithm converges with probability 1.

Measure valued differentiation for stochastic processes : the finite horizon case

This paper addresses the problem of sensitivity analysis for finite horizon performance measures of general Markov chains. We derive closed form expressions and associated unbiased gradient

Adaptive Search Algorithms for Discrete Stochastic Optimization: A Smooth Best-Response Approach

An adaptive random search algorithm that uses a smooth best-response sampling strategy and tracks the set of global optima, yet distributes the search so that most of the effort is spent on simulating the system performance at the global optimum.
...