• Corpus ID: 218571444

A Gradient-Aware Search Algorithm for Constrained Markov Decision Processes

  title={A Gradient-Aware Search Algorithm for Constrained Markov Decision Processes},
  author={Sami Khairy and Prasanna Balaprakash and Lin X. Cai},
The canonical solution methodology for finite constrained Markov decision processes (CMDPs), where the objective is to maximize the expected infinite-horizon discounted rewards subject to the expected infinite-horizon discounted costs constraints, is based on convex linear programming. In this brief, we first prove that the optimization objective in the dual linear program of a finite CMDP is a piece-wise linear convex function (PWLC) with respect to the Lagrange penalty multipliers. Next, we… 

Figures and Tables from this paper



An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes

  • S. Bhatnagar
  • Mathematics, Computer Science
    Syst. Control. Lett.
  • 2010

Provably Efficient Safe Exploration via Primal-Dual Policy Optimization

An Optimistic-Dual Proximal Policy-OPDOP algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration, which is the first provably efficient policy optimization algorithm for CMDPs with safe exploration.

Self Learning Control of Constrained Markov Decision Processes - A Gradient Approach

Stochastic approximation algorithms for computing the locally optimal policy of a constrained average cost finite state Markov Decision process and can handle constraints and time varying parameters are presented.

Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

This paper derives a formula for computing the gradient of the Lagrangian function for percentile risk-constrained Markov decision processes and devise policy gradient and actor-critic algorithms that estimate such gradient, update the policy in the descent direction, and update the Lagrange multiplier in the ascent direction.

An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes

An online actor–critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints and it is proved the asymptotic almost sure convergence of the algorithm to a locally optimal solution.

Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning

This paper proposes a policy search method for CMDPs called Accelerated Primal-Dual Optimization (APDO), which incorporates an off-policy trained dual variable in the dual update procedure while updating the policy in primal space with on-policy likelihood ratio gradient.

Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint

This article focuses on the combination of risk criteria and reinforcement learning in a constrained optimization framework, i.e., a setting where the goal to find a policy that optimizes the usual objective of infinite-horizon discounted/average cost, while ensuring that an explicit risk constraint is satisfied.

Variance-constrained actor-critic algorithms for discounted and average reward MDPs

This paper considers both discounted and average reward Markov decision processes and devise actor-critic algorithms that operate on three timescales—a TD critic on the fastest timescale, a policy gradient (actor) on the intermediate timescale), and a dual ascent for Lagrange multipliers on the slowest timescale.

Dynamic programming equations for discounted constrained stochastic control

The application of the dynamic programming approach to constrained stochastic control problems with expected value constraints is demonstrated and optimality equations are obtained for these problems.

Stationary Deterministic Policies for Constrained MDPs with Multiple Rewards, Costs, and Discount Factors

This work shows that limiting search to stationary deterministic policies, coupled with a novel problem reduction to mixed integer programming, yields an algorithm for finding such policies that is computationally feasible, where no such algorithm has heretofore been identified.