# Real-Time Reinforcement Learning of Constrained Markov Decision Processes with Weak Derivatives.

@article{Krishnamurthy2011RealTimeRL, title={Real-Time Reinforcement Learning of Constrained Markov Decision Processes with Weak Derivatives.}, author={Vikram Krishnamurthy and Felisa V{\'a}zquez Abad}, journal={arXiv: Optimization and Control}, year={2011} }

We present on-line policy gradient algorithms for computing the locally optimal policy of a constrained, average cost, finite state Markov Decision Process. The stochastic approximation algorithms require estimation of the gradient of the cost function with respect to the parameter that characterizes the randomized policy. We propose a spherical coordinate parametrization and present a novel simulation based gradient estimation scheme involving weak derivatives (measure-valued differentiation…

## 6 Citations

### Policy Gradient using Weak Derivatives for Reinforcement Learning

- Computer Science2019 IEEE 58th Conference on Decision and Control (CDC)
- 2019

An alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established and the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem.

### Langevin Dynamics for Inverse Reinforcement Learning of Stochastic Gradient Algorithms

- Computer ScienceArXiv
- 2020

A generalized Langevin dynamics algorithm to estimate the reward function ofverse reinforcement learning is presented; specifically, the resulting Langevin algorithm asymptotically generates samples from the distribution proportional to $\exp(R(\theta)$).

### Langevin Dynamics for Adaptive Inverse Reinforcement Learning of Stochastic Gradient Algorithms

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2021

This paper considers IRL when noisy estimates of the gradient of a reward function generated by multiple stochastic gradient agents are observed, and presents a generalized Langevin dynamics algorithm to estimate the reward function R(θ); specifically, the resulting Langevin algorithm asymptotically generates samples from the distribution proportional to exp(R( θ).

### An Analysis of Measure-Valued Derivatives for Policy Gradients

- Computer Science, MathematicsArXiv
- 2022

It is shown that the Measure-Valued Derivative estimator can be a useful alternative to other policy gradient estimators, which is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators.

### An Empirical Analysis of Measure-Valued Derivatives for Policy Gradients

- Computer Science, Mathematics2021 International Joint Conference on Neural Networks (IJCNN)
- 2021

This work empirically evaluates this estimator in the actor-critic policy gradient setting and shows that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks, both in low and high-dimensional action spaces.

### How to Calibrate Your Adversary's Capabilities? Inverse Filtering for Counter-Autonomous Systems

- Mathematics, Computer ScienceIEEE Transactions on Signal Processing
- 2019

We consider an adversarial Bayesian signal processing problem involving “us” and an “adversary”. The adversary observes our state in noise; updates its posterior distribution of our state and then…

## References

SHOWING 1-10 OF 51 REFERENCES

### Self Learning Control of Constrained Markov Decision Processes - A Gradient Approach

- Computer Science
- 2003

Stochastic approximation algorithms for computing the locally optimal policy of a constrained average cost finite state Markov Decision process and can handle constraints and time varying parameters are presented.

### Implementation of gradient estimation to a constrained Markov decision problem

- Computer Science42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475)
- 2003

This paper identifies the asymptotic bias of the stochastic approximation of the constrained optimization method for the constrained Markov Decision Process, and proposes several means to correct it.

### Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms

- Computer Science
- 1999

This paper presents an algorithm for computing approximations to the gradient of the average reward from a single sample path of the underlying Ma rkov chain and extends this algorithm to the case of partially observabl e Markov decision processes controlled by stochastic polici es.

### Strong points of weak convergence: a study using RPA gradient estimation for automatic learning,

- MathematicsAutom.
- 1999

### Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning

- Computer Science, MathematicsJ. Comput. Syst. Sci.
- 2002

This paper provides a convergence rate for the estimates produced by GPOMDP and gives an improved bound on the approximation error of these estimates, both of these bounds are in terms of mixing times of the POMDP.

### Regime Switching Stochastic Approximation Algorithms with Application to Adaptive Discrete Stochastic Optimization

- Mathematics, Computer ScienceSIAM J. Optim.
- 2004

By a combined use of the SA method and two-time-scale Markov chains, asymptotic properties of the algorithm are obtained, which are distinct from the usual SA techniques.

### Structured Threshold Policies for Dynamic Sensor Scheduling—A Partially Observed Markov Decision Process Approach

- Computer ScienceIEEE Transactions on Signal Processing
- 2007

This work gives sufficient conditions on the cost function, dynamics of the Markov chain and observation probabilities so that the optimal scheduling policy has a threshold structure with respect to a monotone likelihood ratio (MLR) ordering.

### Direct gradient-based reinforcement learning

- Computer Science2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353)
- 2000

An algorithm for computing approximations to the gradient of the average reward from a single sample path of a controlled partially observable Markov decision process is presented and it is proved that the algorithm converges with probability 1.

### Measure valued differentiation for stochastic processes : the finite horizon case

- Mathematics
- 2000

This paper addresses the problem of sensitivity analysis for finite horizon performance measures of general Markov chains. We derive closed form expressions and associated unbiased gradient…

### Adaptive Search Algorithms for Discrete Stochastic Optimization: A Smooth Best-Response Approach

- Computer ScienceIEEE Transactions on Automatic Control
- 2017

An adaptive random search algorithm that uses a smooth best-response sampling strategy and tracks the set of global optima, yet distributes the search so that most of the effort is spent on simulating the system performance at the global optimum.