• Corpus ID: 119115800

# Real-Time Reinforcement Learning of Constrained Markov Decision Processes with Weak Derivatives.

@article{Krishnamurthy2011RealTimeRL,
title={Real-Time Reinforcement Learning of Constrained Markov Decision Processes with Weak Derivatives.},
author={Vikram Krishnamurthy and Felisa V{\'a}zquez Abad},
journal={arXiv: Optimization and Control},
year={2011}
}
• Published 22 October 2011
• Computer Science, Mathematics
• arXiv: Optimization and Control
We present on-line policy gradient algorithms for computing the locally optimal policy of a constrained, average cost, finite state Markov Decision Process. The stochastic approximation algorithms require estimation of the gradient of the cost function with respect to the parameter that characterizes the randomized policy. We propose a spherical coordinate parametrization and present a novel simulation based gradient estimation scheme involving weak derivatives (measure-valued differentiation…
6 Citations

## Figures from this paper

• Computer Science
2019 IEEE 58th Conference on Decision and Control (CDC)
• 2019
An alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established and the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem.
• Computer Science
ArXiv
• 2020
A generalized Langevin dynamics algorithm to estimate the reward function ofverse reinforcement learning is presented; specifically, the resulting Langevin algorithm asymptotically generates samples from the distribution proportional to $\exp(R(\theta)$).
• Computer Science, Mathematics
J. Mach. Learn. Res.
• 2021
This paper considers IRL when noisy estimates of the gradient of a reward function generated by multiple stochastic gradient agents are observed, and presents a generalized Langevin dynamics algorithm to estimate the reward function R(θ); specifically, the resulting Langevin algorithm asymptotically generates samples from the distribution proportional to exp(R( θ).
• Computer Science, Mathematics
ArXiv
• 2022
It is shown that the Measure-Valued Derivative estimator can be a useful alternative to other policy gradient estimators, which is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators.
• Computer Science, Mathematics
2021 International Joint Conference on Neural Networks (IJCNN)
• 2021
This work empirically evaluates this estimator in the actor-critic policy gradient setting and shows that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks, both in low and high-dimensional action spaces.
• Mathematics, Computer Science
IEEE Transactions on Signal Processing
• 2019
We consider an adversarial Bayesian signal processing problem involving “us” and an “adversary”. The adversary observes our state in noise; updates its posterior distribution of our state and then

## References

SHOWING 1-10 OF 51 REFERENCES

• Computer Science
• 2003
Stochastic approximation algorithms for computing the locally optimal policy of a constrained average cost finite state Markov Decision process and can handle constraints and time varying parameters are presented.
• Computer Science
42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475)
• 2003
This paper identifies the asymptotic bias of the stochastic approximation of the constrained optimization method for the constrained Markov Decision Process, and proposes several means to correct it.
• Computer Science
• 1999
This paper presents an algorithm for computing approximations to the gradient of the average reward from a single sample path of the underlying Ma rkov chain and extends this algorithm to the case of partially observabl e Markov decision processes controlled by stochastic polici es.
• Computer Science, Mathematics
J. Comput. Syst. Sci.
• 2002
This paper provides a convergence rate for the estimates produced by GPOMDP and gives an improved bound on the approximation error of these estimates, both of these bounds are in terms of mixing times of the POMDP.
• Mathematics, Computer Science
SIAM J. Optim.
• 2004
By a combined use of the SA method and two-time-scale Markov chains, asymptotic properties of the algorithm are obtained, which are distinct from the usual SA techniques.
• Computer Science
IEEE Transactions on Signal Processing
• 2007
This work gives sufficient conditions on the cost function, dynamics of the Markov chain and observation probabilities so that the optimal scheduling policy has a threshold structure with respect to a monotone likelihood ratio (MLR) ordering.
• Computer Science
2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353)
• 2000
An algorithm for computing approximations to the gradient of the average reward from a single sample path of a controlled partially observable Markov decision process is presented and it is proved that the algorithm converges with probability 1.
• Mathematics
• 2000
This paper addresses the problem of sensitivity analysis for finite horizon performance measures of general Markov chains. We derive closed form expressions and associated unbiased gradient
• Computer Science
IEEE Transactions on Automatic Control
• 2017
An adaptive random search algorithm that uses a smooth best-response sampling strategy and tracks the set of global optima, yet distributes the search so that most of the effort is spent on simulating the system performance at the global optimum.