Towards Painless Policy Optimization for Constrained MDPs

  title={Towards Painless Policy Optimization for Constrained MDPs},
  author={Arushi Jain and Sharan Vaswani and Reza Babanezhad and Csaba Szepesvari and Doina Precup},
  booktitle={Conference on Uncertainty in Artificial Intelligence},
We study policy optimization in an infinite horizon, γ -discounted constrained Markov decision process (CMDP). Our objective is to return a policy that achieves large expected reward with a small constraint violation. We consider the online setting with linear function approximation and assume global access to the corresponding features. We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms in terms of… 

Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning

An Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework is proposed, which can systematically incorporate ideas from well-performing first-order methods into the design of policy optimization algorithms for multi-objective MDP problems.

Provable Reset-free Reinforcement Learning by No-Regret Reduction

This work proposes a generic no-regret reduction to systematically design reset-free RL algorithms, and designs an instantiation for linear Markov decision processes, which is the first provably correct reset- free RL algorithm to the authors' knowledge.



Provably Efficient Safe Exploration via Primal-Dual Policy Optimization

An Optimistic-Dual Proximal Policy-OPDOP algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration, which is the first provably efficient policy optimization algorithm for CMDPs with safe exploration.

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

This work is the first to establish non-asymptotic convergence guarantees of policybased primal-dual methods for solving infinite-horizon discounted CMDPs, and it is shown that two samplebased NPG-PD algorithms inherit such non- ATM convergence properties and provide finite-sample complexity guarantees.

CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

The empirical results demonstrate that CRPO can out-perform the existing primal-dual baseline algorithms significantly and achieve an O (1 / √ T ) convergence rate to the global optimal policy in the constrained policy set and an error bound on constraint satisfaction.

IPO: Interior-point Policy Optimization under Constraints

A novel first-order policy optimization method is proposed, Interior-point Policy Optimization (IPO), which augments the objective with logarithmic barrier functions, inspired by the interior-point method, which can handle general types of cumulative multi-constraint settings.

Constrained Policy Optimization

Constrained Policy Optimization (CPO) is proposed, the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration, and allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training.

Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning

This paper proposes a policy search method for CMDPs called Accelerated Primal-Dual Optimization (APDO), which incorporates an off-policy trained dual variable in the dual update procedure while updating the policy in primal space with on-policy likelihood ratio gradient.

POLITEX: Regret Bounds for Policy Iteration using Expert Prediction

POLicy ITeration with EXpert advice is presented, a variant of policy iteration where each policy is a Boltzmann distribution over the sum of action-value function estimates of the previous policies, and the viability of POLITEX beyond linear function approximation is confirmed.

Reward Constrained Policy Optimization

This work presents a novel multi-timescale approach for constrained policy optimization, called `Reward Constrained Policy Optimization' (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one.

Chance-constrained dynamic programming with application to risk-aware robotic space exploration

This paper presents a novel algorithmic approach to reformulate a joint chance constraint as a constraint on the expectation of a summation of indicator random variables, which can be incorporated into the cost function by considering a dual formulation of the optimization problem.

A Lyapunov-based Approach to Safe Reinforcement Learning

This work defines and presents a method for constructing Lyapunov functions, which provide an effective way to guarantee the global safety of a behavior policy during training via a set of local, linear constraints.