# A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization

@article{Ying2021ADA, title={A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization}, author={Donghao Ying and Yuhao Ding and Javad Lavaei}, journal={ArXiv}, year={2021}, volume={abs/2110.08923} }

We study entropy-regularized constrained Markov decision processes (CMDPs) under the soft-max parameterization, in which an agent aims to maximize the entropy-regularized value function while satisfying constraints on the expected total utility. By leveraging the entropy regularization, our theoretical analysis shows that its Lagrangian dual function is smooth and the Lagrangian duality gap can be decomposed into the primal optimality gap and the constraint violation. Furthermore, we propose an…

## 12 Citations

### Policy-based Primal-Dual Methods for Convex Constrained Markov Decision Processes

- Computer Science, MathematicsArXiv
- 2022

This work establishes non-asymptotic convergence guarantees for policy-based primal-dual methods for solving in-horizon discounted convex CMDPs by introducing a pessimistic term to the constraint.

### Algorithm for Constrained Markov Decision Process with Linear Convergence

- Computer ScienceArXiv
- 2022

A new dual approach is proposed with the integration of two ingredients: entropy-regularized policy optimizer and Vaidya’s dual optimizer, both of which are critical to achieve faster convergence in constrained Markov decision process.

### Policy gradient primal-dual mirror descent for constrained MDPs with large state spaces

- Mathematics2022 IEEE 61st Conference on Decision and Control (CDC)
- 2022

We study constrained sequential decision-making problems modeled by constrained Markov decision processes with potentially infinite state spaces. We propose a Bregman distance-based direct policy…

### Policy Optimization for Constrained MDPs with Provable Fast Global Convergence

- Computer Science
- 2021

This work proposes a new algorithm called policy mirror descent-primal dual (PMD-PD) algorithm that can provably achieve a faster O(log(T )/T ) convergence rate for both the optimality gap and the constraint violation.

### Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

- Computer ScienceArXiv
- 2022

A dynamic regret bound and a constraint violation bound are established for the proposed algorithm in both the linear kernel CMDP function approximation setting and the tabular CMDP setting under two alternative assumptions.

### Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning

- Computer ScienceArXiv
- 2022

An Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework is proposed, which can systematically incorporate ideas from well-performing ﬁrst-order methods into the design of policy optimization algorithms for multi-objective MDP problems.

### Fast Global Convergence of Policy Optimization for Constrained MDPs

- Computer ScienceArXiv
- 2021

This work exhibits a natural policy gradient-based algorithm that has a faster convergence rate O(log(T )/T ) for both the optimality gap and the constraint violation.

### Robust Constrained Reinforcement Learning

- Computer ScienceArXiv
- 2022

This work designs a robust primal-dual approach, and further theoretically develop guarantee on its convergence, complexity and robust feasibility, and investigates a concrete example of δ -contamination uncertainty set, design an online and model-free algorithm and theoretically characterize its sample complexity.

### Regularized Gradient Descent Ascent for Two-Player Zero-Sum Markov Games

- Computer Science, MathematicsArXiv
- 2022

The main contribution is to show that under proper choices of the regularization parameter, the gradient descent ascent algorithm converges to the Nash equilibrium of the original unregularized problem.

### Provable Guarantees for Meta-Safe Reinforcement Learning

- Computer Science
- 2022

The proposed theoretical framework is the first to handle the nonconvexity and stochasticity nature of within-task CMDPs, while exploiting inter-task dependency and intra-task geometries for multi-task safe learning.

## References

SHOWING 1-10 OF 36 REFERENCES

### Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

- Computer Science, MathematicsNeurIPS
- 2020

This work is the first to establish non-asymptotic convergence guarantees of policybased primal-dual methods for solving infinite-horizon discounted CMDPs, and it is shown that two samplebased NPG-PD algorithms inherit such non- ATM convergence properties and provide finite-sample complexity guarantees.

### On the Global Convergence Rates of Softmax Policy Gradient Methods

- Computer ScienceICML
- 2020

It is shown that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization, which significantly expands the recent asymptotic convergence results.

### Provably Efficient Safe Exploration via Primal-Dual Policy Optimization

- Computer ScienceAISTATS
- 2021

An Optimistic-Dual Proximal Policy-OPDOP algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration, which is the first provably efficient policy optimization algorithm for CMDPs with safe exploration.

### An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes

- Computer Science, MathematicsJ. Optim. Theory Appl.
- 2012

An online actor–critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints and it is proved the asymptotic almost sure convergence of the algorithm to a locally optimal solution.

### Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

- Computer ScienceOper. Res.
- 2022

This work develops nonasymptotic convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on tabular discounted Markov decision processes and demonstrates that the algorithm converges linearly at an astonishing rate that is independent of the dimension of the state-action space.

### Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

- Computer ScienceJ. Mach. Learn. Res.
- 2017

This paper derives a formula for computing the gradient of the Lagrangian function for percentile risk-constrained Markov decision processes and devise policy gradient and actor-critic algorithms that estimate such gradient, update the policy in the descent direction, and update the Lagrange multiplier in the ascent direction.

### Modeling purposeful adaptive behavior with the principle of maximum causal entropy

- Computer Science
- 2010

The principle of maximum causal entropy is introduced, a general technique for applying information theory to decision-theoretic, game-the theoretical, and control settings where relevant information is sequentially revealed over time.

### CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

- Computer ScienceICML
- 2021

The empirical results demonstrate that CRPO can out-perform the existing primal-dual baseline algorithms signiﬁcantly and achieve an O (1 / √ T ) convergence rate to the global optimal policy in the constrained policy set and an error bound on constraint satisfaction.

### Safe Policies for Reinforcement Learning via Primal-Dual Methods

- Computer ScienceArXiv
- 2019

It is established that primal-dual algorithms are able to find policies that are safe and optimal, and an ergodic relaxation of the safe-learning problem is proposed.

### On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

- Computer ScienceJ. Mach. Learn. Res.
- 2021

This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs), and shows an important interplay between estimation error, approximation error, and exploration.