• Corpus ID: 219792541

Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss

@article{Qiu2020UpperCP,
title={Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss},
author={Shuang Qiu and Xiaohan Wei and Zhuoran Yang and Jieping Ye and Zhaoran Wang},
journal={arXiv: Learning},
year={2020}
}
We consider online learning for episodic stochastically constrained Markov decision processes (CMDP), which plays a central role in ensuring the safety of reinforcement learning. Here the loss function can vary arbitrarily across the episodes, whereas both the loss received and the budget consumption are revealed at the end of each episode. Previous works solve this problem under the restrictive assumption that the transition model of the Markov decision processes (MDP) is known a priori and…
A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints
• Computer Science, Engineering
AAAI
• 2021
An online algorithm is proposed which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning to provide a probably approximately correct (PAC) guarantee on the number of episodes needed to ensure an $\epsilon$-optimal policy.
Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret
• Computer Science, Mathematics
NeurIPS
• 2020
The results demonstrate that incorporating risk awareness into reinforcement learning necessitates an exponential cost in β and H, which quantifies the fundamental tradeoff between risk sensitivity and sample efficiency.
Provably Efficient Algorithms for Multi-Objective Competitive RL
• Computer Science
ICML
• 2021
This work provides the first provably efficient algorithms for vector-valued Markov games and the theoretical guarantees are near-optimal.
A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes
• Computer Science
ArXiv
• 2021
This paper presents the first model-free, simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation, named Triple-Q, which is similar to SARSA for unconstrained MDPs, and is computationally efficient.
Triple-Q: A Model-Free Algorithm for Constrained Reinforcement Learning with Sublinear Regret and Zero Constraint Violation
• 2021
This paper presents the first model-free, simulatorfree reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The
A Lyapunov-Based Methodology for Constrained Optimization with Bandit Feedback
• Semih Cayci, Yilin Zheng
• Computer Science, Mathematics
ArXiv
• 2021
A novel low-complexity algorithm based on Lyapunov optimization methodology, named LyOn, is proposed and it is proved that it achieves O( √ B logB) regret and O(logB/B) constraint-violation.
Concave Utility Reinforcement Learning with Zero-Constraint Violations
• Computer Science
ArXiv
• 2021
A model-based learning algorithm is proposed that achieves zero constraint violations and a regret guarantee for objective is obtained which grows as Õ(1/ √ T ), excluding other factors.
Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs
• Tao Liu, P. R. Kumar
• Computer Science
ArXiv
• 2021
It is shown that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order Õ( √ K).
Markov Decision Processes with Long-Term Average Constraints
• Computer Science, Engineering
ArXiv
• 2021
It is proved that following CMDP-PSRL algorithm, the agent can bound the regret of not accumulating rewards from optimal policy by Õ(poly(DSA) √ T ), and to the best of the knowledge, this is the first work which obtains a Ú T regret bounds for ergodic MDPs with long-term average constraints.
Offline Constrained Multi-Objective Reinforcement Learning via Pessimistic Dual Value Iteration
• Runzhe Wu
• 2021
In constrained multi-objective RL, the goal is to learn a policy that achieves the best performance specified by a multi-objective preference function under a constraint. We focus on the offline

References

SHOWING 1-10 OF 69 REFERENCES
Learning Adversarial MDPs with Bandit Feedback and Unknown Transition
• Computer Science, Mathematics
ArXiv
• 2019
The algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting and achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback.
Online Convex Optimization in Adversarial Markov Decision Processes
• Computer Science, Mathematics
ICML
• 2019
We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes, and the transition function is not known to the
Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes
• Computer Science, Mathematics
ICML
• 2020
This result significantly improves over the $\mathcal{O}(T^{3/4})$ regret achieved by the only existing model-free algorithm by Abbasi-Yadkori et al. (2019a) for ergodic MDPs in the infinite-horizon average-reward setting.
Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP
• Computer Science, Mathematics
ICLR
• 2020
It is shown that the sample complexity of exploration of the proposed Q-learning algorithm is bounded by $\tilde{O}({\frac{SA}{\epsilon^2(1-\gamma)^7}})$, which improves the previously best known result.
Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function
• Computer Science, Mathematics
NeurIPS
• 2019
An algorithm based on the OFU principle is presented which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently and outperforms the best previous regret bounds.
Optimistic Policy Optimization with Bandit Feedback
• Computer Science, Mathematics
ICML
• 2020
This paper considers model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback, and proposes an optimistic trust region policy optimization (TRPO) algorithm, which establishes regret for stochastic rewards and proves regret for adversarial rewards.
Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function
• Computer Science, Mathematics
NeurIPS
• 2019
This work develops no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes.
Constrained Upper Confidence Reinforcement Learning
• Computer Science, Mathematics
L4DC
• 2020
An algorithm C-UCRL is presented and it is shown that it achieves sub-linear regret with respect to the reward while satisfying the constraints even while learning with probability $1-\delta$.
The adversarial stochastic shortest path problem with unknown transition probabilities
• Mathematics, Computer Science
AISTATS
• 2012
This paper proposes an algorithm called “follow the perturbed optimistic policy”, an algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time, and it is proved that the expected cumulative regret of the algorithm is of order L||A| p T up to logarithmic factors.
Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning
• Computer Science, Mathematics
ICML
• 2018
The optimization problem at the core of REGAL.C is relaxed, the first computationally efficient algorithm to solve it is provided, and numerical simulations are reported supporting the theoretical findings and showing how SCAL significantly outperforms UCRL in MDPs with large diameter and small span.