# Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss

@article{Qiu2020UpperCP, title={Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss}, author={Shuang Qiu and Xiaohan Wei and Zhuoran Yang and Jieping Ye and Zhaoran Wang}, journal={arXiv: Learning}, year={2020} }

We consider online learning for episodic stochastically constrained Markov decision processes (CMDP), which plays a central role in ensuring the safety of reinforcement learning. Here the loss function can vary arbitrarily across the episodes, whereas both the loss received and the budget consumption are revealed at the end of each episode. Previous works solve this problem under the restrictive assumption that the transition model of the Markov decision processes (MDP) is known a priori and…

## Topics from this paper

## 11 Citations

A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints

- Computer Science, EngineeringAAAI
- 2021

An online algorithm is proposed which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning to provide a probably approximately correct (PAC) guarantee on the number of episodes needed to ensure an $\epsilon$-optimal policy.

Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret

- Computer Science, MathematicsNeurIPS
- 2020

The results demonstrate that incorporating risk awareness into reinforcement learning necessitates an exponential cost in β and H, which quantifies the fundamental tradeoff between risk sensitivity and sample efficiency.

Provably Efficient Algorithms for Multi-Objective Competitive RL

- Computer ScienceICML
- 2021

This work provides the first provably efficient algorithms for vector-valued Markov games and the theoretical guarantees are near-optimal.

A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes

- Computer ScienceArXiv
- 2021

This paper presents the first model-free, simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation, named Triple-Q, which is similar to SARSA for unconstrained MDPs, and is computationally efficient.

Triple-Q: A Model-Free Algorithm for Constrained Reinforcement Learning with Sublinear Regret and Zero Constraint Violation

- 2021

This paper presents the first model-free, simulatorfree reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The…

A Lyapunov-Based Methodology for Constrained Optimization with Bandit Feedback

- Computer Science, MathematicsArXiv
- 2021

A novel low-complexity algorithm based on Lyapunov optimization methodology, named LyOn, is proposed and it is proved that it achieves O( √ B logB) regret and O(logB/B) constraint-violation.

Concave Utility Reinforcement Learning with Zero-Constraint Violations

- Computer ScienceArXiv
- 2021

A model-based learning algorithm is proposed that achieves zero constraint violations and a regret guarantee for objective is obtained which grows as Õ(1/ √ T ), excluding other factors.

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

- Computer ScienceArXiv
- 2021

It is shown that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order Õ( √ K).

Markov Decision Processes with Long-Term Average Constraints

- Computer Science, EngineeringArXiv
- 2021

It is proved that following CMDP-PSRL algorithm, the agent can bound the regret of not accumulating rewards from optimal policy by Õ(poly(DSA) √ T ), and to the best of the knowledge, this is the first work which obtains a Ú T regret bounds for ergodic MDPs with long-term average constraints.

Offline Constrained Multi-Objective Reinforcement Learning via Pessimistic Dual Value Iteration

- 2021

In constrained multi-objective RL, the goal is to learn a policy that achieves the best performance specified by a multi-objective preference function under a constraint. We focus on the offline…

## References

SHOWING 1-10 OF 69 REFERENCES

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

- Computer Science, MathematicsArXiv
- 2019

The algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting and achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback.

Online Convex Optimization in Adversarial Markov Decision Processes

- Computer Science, MathematicsICML
- 2019

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes, and the transition function is not known to the…

Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes

- Computer Science, MathematicsICML
- 2020

This result significantly improves over the $\mathcal{O}(T^{3/4})$ regret achieved by the only existing model-free algorithm by Abbasi-Yadkori et al. (2019a) for ergodic MDPs in the infinite-horizon average-reward setting.

Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

- Computer Science, MathematicsICLR
- 2020

It is shown that the sample complexity of exploration of the proposed Q-learning algorithm is bounded by $\tilde{O}({\frac{SA}{\epsilon^2(1-\gamma)^7}})$, which improves the previously best known result.

Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

- Computer Science, MathematicsNeurIPS
- 2019

An algorithm based on the OFU principle is presented which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently and outperforms the best previous regret bounds.

Optimistic Policy Optimization with Bandit Feedback

- Computer Science, MathematicsICML
- 2020

This paper considers model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback, and proposes an optimistic trust region policy optimization (TRPO) algorithm, which establishes regret for stochastic rewards and proves regret for adversarial rewards.

Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

- Computer Science, MathematicsNeurIPS
- 2019

This work develops no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes.

Constrained Upper Confidence Reinforcement Learning

- Computer Science, MathematicsL4DC
- 2020

An algorithm C-UCRL is presented and it is shown that it achieves sub-linear regret with respect to the reward while satisfying the constraints even while learning with probability $1-\delta$.

The adversarial stochastic shortest path problem with unknown transition probabilities

- Mathematics, Computer ScienceAISTATS
- 2012

This paper proposes an algorithm called “follow the perturbed optimistic policy”, an algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time, and it is proved that the expected cumulative regret of the algorithm is of order L||A| p T up to logarithmic factors.

Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

- Computer Science, MathematicsICML
- 2018

The optimization problem at the core of REGAL.C is relaxed, the first computationally efficient algorithm to solve it is provided, and numerical simulations are reported supporting the theoretical findings and showing how SCAL significantly outperforms UCRL in MDPs with large diameter and small span.