• Corpus ID: 239016671

A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization

  title={A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization},
  author={Donghao Ying and Yuhao Ding and Javad Lavaei},
We study entropy-regularized constrained Markov decision processes (CMDPs) under the soft-max parameterization, in which an agent aims to maximize the entropy-regularized value function while satisfying constraints on the expected total utility. By leveraging the entropy regularization, our theoretical analysis shows that its Lagrangian dual function is smooth and the Lagrangian duality gap can be decomposed into the primal optimality gap and the constraint violation. Furthermore, we propose an… 

Policy-based Primal-Dual Methods for Convex Constrained Markov Decision Processes

This work establishes non-asymptotic convergence guarantees for policy-based primal-dual methods for solving in-horizon discounted convex CMDPs by introducing a pessimistic term to the constraint.

Algorithm for Constrained Markov Decision Process with Linear Convergence

A new dual approach is proposed with the integration of two ingredients: entropy-regularized policy optimizer and Vaidya’s dual optimizer, both of which are critical to achieve faster convergence in constrained Markov decision process.

Policy gradient primal-dual mirror descent for constrained MDPs with large state spaces

We study constrained sequential decision-making problems modeled by constrained Markov decision processes with potentially infinite state spaces. We propose a Bregman distance-based direct policy

Policy Optimization for Constrained MDPs with Provable Fast Global Convergence

This work proposes a new algorithm called policy mirror descent-primal dual (PMD-PD) algorithm that can provably achieve a faster O(log(T )/T ) convergence rate for both the optimality gap and the constraint violation.

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

A dynamic regret bound and a constraint violation bound are established for the proposed algorithm in both the linear kernel CMDP function approximation setting and the tabular CMDP setting under two alternative assumptions.

Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning

An Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework is proposed, which can systematically incorporate ideas from well-performing first-order methods into the design of policy optimization algorithms for multi-objective MDP problems.

Fast Global Convergence of Policy Optimization for Constrained MDPs

This work exhibits a natural policy gradient-based algorithm that has a faster convergence rate O(log(T )/T ) for both the optimality gap and the constraint violation.

Robust Constrained Reinforcement Learning

This work designs a robust primal-dual approach, and further theoretically develop guarantee on its convergence, complexity and robust feasibility, and investigates a concrete example of δ -contamination uncertainty set, design an online and model-free algorithm and theoretically characterize its sample complexity.

Regularized Gradient Descent Ascent for Two-Player Zero-Sum Markov Games

The main contribution is to show that under proper choices of the regularization parameter, the gradient descent ascent algorithm converges to the Nash equilibrium of the original unregularized problem.

Provable Guarantees for Meta-Safe Reinforcement Learning

The proposed theoretical framework is the first to handle the nonconvexity and stochasticity nature of within-task CMDPs, while exploiting inter-task dependency and intra-task geometries for multi-task safe learning.



Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

This work is the first to establish non-asymptotic convergence guarantees of policybased primal-dual methods for solving infinite-horizon discounted CMDPs, and it is shown that two samplebased NPG-PD algorithms inherit such non- ATM convergence properties and provide finite-sample complexity guarantees.

On the Global Convergence Rates of Softmax Policy Gradient Methods

It is shown that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization, which significantly expands the recent asymptotic convergence results.

Provably Efficient Safe Exploration via Primal-Dual Policy Optimization

An Optimistic-Dual Proximal Policy-OPDOP algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration, which is the first provably efficient policy optimization algorithm for CMDPs with safe exploration.

An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes

An online actor–critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints and it is proved the asymptotic almost sure convergence of the algorithm to a locally optimal solution.

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

This work develops nonasymptotic convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on tabular discounted Markov decision processes and demonstrates that the algorithm converges linearly at an astonishing rate that is independent of the dimension of the state-action space.

Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

This paper derives a formula for computing the gradient of the Lagrangian function for percentile risk-constrained Markov decision processes and devise policy gradient and actor-critic algorithms that estimate such gradient, update the policy in the descent direction, and update the Lagrange multiplier in the ascent direction.

Modeling purposeful adaptive behavior with the principle of maximum causal entropy

The principle of maximum causal entropy is introduced, a general technique for applying information theory to decision-theoretic, game-the theoretical, and control settings where relevant information is sequentially revealed over time.

CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

The empirical results demonstrate that CRPO can out-perform the existing primal-dual baseline algorithms significantly and achieve an O (1 / √ T ) convergence rate to the global optimal policy in the constrained policy set and an error bound on constraint satisfaction.

Safe Policies for Reinforcement Learning via Primal-Dual Methods

It is established that primal-dual algorithms are able to find policies that are safe and optimal, and an ergodic relaxation of the safe-learning problem is proposed.

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs), and shows an important interplay between estimation error, approximation error, and exploration.