• Corpus ID: 219792541

Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss

@article{Qiu2020UpperCP,
  title={Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss},
  author={Shuang Qiu and Xiaohan Wei and Zhuoran Yang and Jieping Ye and Zhaoran Wang},
  journal={arXiv: Learning},
  year={2020}
}
We consider online learning for episodic stochastically constrained Markov decision processes (CMDP), which plays a central role in ensuring the safety of reinforcement learning. Here the loss function can vary arbitrarily across the episodes, whereas both the loss received and the budget consumption are revealed at the end of each episode. Previous works solve this problem under the restrictive assumption that the transition model of the Markov decision processes (MDP) is known a priori and… 
A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints
TLDR
An online algorithm is proposed which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning to provide a probably approximately correct (PAC) guarantee on the number of episodes needed to ensure an $\epsilon$-optimal policy.
Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret
TLDR
The results demonstrate that incorporating risk awareness into reinforcement learning necessitates an exponential cost in β and H, which quantifies the fundamental tradeoff between risk sensitivity and sample efficiency.
Provably Efficient Algorithms for Multi-Objective Competitive RL
TLDR
This work provides the first provably efficient algorithms for vector-valued Markov games and the theoretical guarantees are near-optimal.
A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes
TLDR
This paper presents the first model-free, simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation, named Triple-Q, which is similar to SARSA for unconstrained MDPs, and is computationally efficient.
Triple-Q: A Model-Free Algorithm for Constrained Reinforcement Learning with Sublinear Regret and Zero Constraint Violation
This paper presents the first model-free, simulatorfree reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The
A Lyapunov-Based Methodology for Constrained Optimization with Bandit Feedback
TLDR
A novel low-complexity algorithm based on Lyapunov optimization methodology, named LyOn, is proposed and it is proved that it achieves O( √ B logB) regret and O(logB/B) constraint-violation.
Concave Utility Reinforcement Learning with Zero-Constraint Violations
TLDR
A model-based learning algorithm is proposed that achieves zero constraint violations and a regret guarantee for objective is obtained which grows as Õ(1/ √ T ), excluding other factors.
Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs
TLDR
It is shown that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order Õ( √ K).
Markov Decision Processes with Long-Term Average Constraints
TLDR
It is proved that following CMDP-PSRL algorithm, the agent can bound the regret of not accumulating rewards from optimal policy by Õ(poly(DSA) √ T ), and to the best of the knowledge, this is the first work which obtains a Ú T regret bounds for ergodic MDPs with long-term average constraints.
Offline Constrained Multi-Objective Reinforcement Learning via Pessimistic Dual Value Iteration
  • Runzhe Wu
  • 2021
In constrained multi-objective RL, the goal is to learn a policy that achieves the best performance specified by a multi-objective preference function under a constraint. We focus on the offline
...
1
2
...

References

SHOWING 1-10 OF 69 REFERENCES
Learning Adversarial MDPs with Bandit Feedback and Unknown Transition
TLDR
The algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting and achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback.
Online Convex Optimization in Adversarial Markov Decision Processes
We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes, and the transition function is not known to the
Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes
TLDR
This result significantly improves over the $\mathcal{O}(T^{3/4})$ regret achieved by the only existing model-free algorithm by Abbasi-Yadkori et al. (2019a) for ergodic MDPs in the infinite-horizon average-reward setting.
Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP
TLDR
It is shown that the sample complexity of exploration of the proposed Q-learning algorithm is bounded by $\tilde{O}({\frac{SA}{\epsilon^2(1-\gamma)^7}})$, which improves the previously best known result.
Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function
TLDR
An algorithm based on the OFU principle is presented which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently and outperforms the best previous regret bounds.
Optimistic Policy Optimization with Bandit Feedback
TLDR
This paper considers model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback, and proposes an optimistic trust region policy optimization (TRPO) algorithm, which establishes regret for stochastic rewards and proves regret for adversarial rewards.
Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function
TLDR
This work develops no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes.
Constrained Upper Confidence Reinforcement Learning
TLDR
An algorithm C-UCRL is presented and it is shown that it achieves sub-linear regret with respect to the reward while satisfying the constraints even while learning with probability $1-\delta$.
The adversarial stochastic shortest path problem with unknown transition probabilities
TLDR
This paper proposes an algorithm called “follow the perturbed optimistic policy”, an algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time, and it is proved that the expected cumulative regret of the algorithm is of order L||A| p T up to logarithmic factors.
Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning
TLDR
The optimization problem at the core of REGAL.C is relaxed, the first computationally efficient algorithm to solve it is provided, and numerical simulations are reported supporting the theoretical findings and showing how SCAL significantly outperforms UCRL in MDPs with large diameter and small span.
...
1
2
3
4
5
...