# A Gradient-Aware Search Algorithm for Constrained Markov Decision Processes

@article{Khairy2020AGS, title={A Gradient-Aware Search Algorithm for Constrained Markov Decision Processes}, author={Sami Khairy and Prasanna Balaprakash and Lin X. Cai}, journal={ArXiv}, year={2020}, volume={abs/2005.03718} }

The canonical solution methodology for finite constrained Markov decision processes (CMDPs), where the objective is to maximize the expected infinite-horizon discounted rewards subject to the expected infinite-horizon discounted costs constraints, is based on convex linear programming. In this brief, we first prove that the optimization objective in the dual linear program of a finite CMDP is a piece-wise linear convex function (PWLC) with respect to the Lagrange penalty multipliers. Next, we…

## References

SHOWING 1-10 OF 36 REFERENCES

### An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes

- Mathematics, Computer ScienceSyst. Control. Lett.
- 2010

### Provably Efficient Safe Exploration via Primal-Dual Policy Optimization

- Computer ScienceAISTATS
- 2021

An Optimistic-Dual Proximal Policy-OPDOP algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration, which is the first provably efficient policy optimization algorithm for CMDPs with safe exploration.

### Self Learning Control of Constrained Markov Decision Processes - A Gradient Approach

- Computer Science
- 2003

Stochastic approximation algorithms for computing the locally optimal policy of a constrained average cost finite state Markov Decision process and can handle constraints and time varying parameters are presented.

### Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

- Computer ScienceJ. Mach. Learn. Res.
- 2017

This paper derives a formula for computing the gradient of the Lagrangian function for percentile risk-constrained Markov decision processes and devise policy gradient and actor-critic algorithms that estimate such gradient, update the policy in the descent direction, and update the Lagrange multiplier in the ascent direction.

### An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes

- Computer Science, MathematicsJ. Optim. Theory Appl.
- 2012

An online actor–critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints and it is proved the asymptotic almost sure convergence of the algorithm to a locally optimal solution.

### Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning

- Computer ScienceArXiv
- 2018

This paper proposes a policy search method for CMDPs called Accelerated Primal-Dual Optimization (APDO), which incorporates an off-policy trained dual variable in the dual update procedure while updating the policy in primal space with on-policy likelihood ratio gradient.

### Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint

- EconomicsArXiv
- 2018

This article focuses on the combination of risk criteria and reinforcement learning in a constrained optimization framework, i.e., a setting where the goal to find a policy that optimizes the usual objective of infinite-horizon discounted/average cost, while ensuring that an explicit risk constraint is satisfied.

### Variance-constrained actor-critic algorithms for discounted and average reward MDPs

- Computer ScienceMachine Learning
- 2016

This paper considers both discounted and average reward Markov decision processes and devise actor-critic algorithms that operate on three timescales—a TD critic on the fastest timescale, a policy gradient (actor) on the intermediate timescale), and a dual ascent for Lagrange multipliers on the slowest timescale.

### Dynamic programming equations for discounted constrained stochastic control

- MathematicsIEEE Transactions on Automatic Control
- 2004

The application of the dynamic programming approach to constrained stochastic control problems with expected value constraints is demonstrated and optimality equations are obtained for these problems.

### Stationary Deterministic Policies for Constrained MDPs with Multiple Rewards, Costs, and Discount Factors

- Computer ScienceIJCAI
- 2005

This work shows that limiting search to stationary deterministic policies, coupled with a novel problem reduction to mixed integer programming, yields an algorithm for finding such policies that is computationally feasible, where no such algorithm has heretofore been identified.