# Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints

@inproceedings{Ketnsk2018LearningBasedMO,
title={Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints},
author={Jan Křet{\'i}nsk{\'y} and Guillermo A. P{\'e}rez and Jean-François Raskin},
booktitle={International Conference on Concurrency Theory},
year={2018}
}
• Published in
International Conference on…
24 April 2018
• Computer Science
We formalize the problem of maximizing the mean-payoff value with high probability while satisfying a parity objective in a Markov decision process (MDP) with unknown probabilistic transition function and unknown reward function. Assuming the support of the unknown transition function and a lower bound on the minimal transition probability are known in advance, we show that in MDPs consisting of a single end component, two combinations of guarantees on the parity and mean-payoff objectives can…
18 Citations

## Figures from this paper

• Computer Science
• 2019
We formalize the problem of maximizing the mean-payoff value with high probability while satisfying a parity objective in a Markov decision process (MDP) with unknown probabilistic transition
• Computer Science
CAV
• 2022
This work provides the first algorithm to compute mean payoff probably approximately correctly in unknown MDP; further, it is extended to unknown CTMDP and demonstrates its practical nature by running experiments on standard benchmarks.
• Computer Science, Mathematics
NeurIPS
• 2019
An algorithm is provided that achieves state-of-the-art regret bound of $O( \tilde{O}(\sqrt{T})$ regret bound for large scale MDPs with changing rewards, which to the best of the knowledge is the first.
• Computer Science
ICAART
• 2021
The results show that using Angluin's active learning algorithm to learn an MRM in a non-Markovian reward decision process is effective and it is proved that the expected reward achieved will eventually be at least as much as a given, reasonable value provided by a domain expert.
• Computer Science
UAI
• 2020
It is shown that near optimality can be achieved almost surely, using an unintuitive gadget the authors call forgetfulness, and the approach is extended to a setting with partial knowledge of the system topology, introducing two optimality measures and providing near-optimal algorithms also for these cases.
• Computer Science
ArXiv
• 2020
The approach is a careful combination of the Angluin's L* active learning algorithm to learn finite automata, testing techniques for establishing conformance of finite model hypothesis and optimisation techniques for computing optimal strategies in Markovian (immediate) reward MDPs.
• Computer Science
IEEE Transactions on Automatic Control
• 2022
This article model the interaction between an RL agent and its potentially adversarial environment as a turn-based zero-sum stochastic game and proposes a probably approximately correct (PAC) learning algorithm that learns such a strategy efficiently in an online manner with unknown reward functions and unknown transition distributions.
FM
• 2021
. We study the problem of ﬁnding optimal strategies in Markov decision processes with lexicographic ω -regular objectives, which are ordered collections of ordinary ω -regular objectives. The goal is
• Computer Science
• 2018
This paper introduces the concept of a probabilistic shield that enables decision-making to adhere to safety constraints with high probability and discusses tradeoffs between sufficient progress in exploration of the environment and ensuring safety.
• Computer Science
• 2020
The concept of a probabilistic shield that enables RL decision-making to adhere to safety constraints with high probability is introduced and used to realize a shield that restricts the agent from taking unsafe actions, while optimizing the performance objective.

## References

SHOWING 1-10 OF 40 REFERENCES

• Economics
AAAI
• 2017
This work goes beyond both the “expectation” and “threshold” approaches and considers a “guaranteed payoff optimization (GPO)” problem for POMDPs, where the objective is to find a policy σ such that each possible outcome yields a discounted-sum payoff of at least t.
• Economics
MFCS
• 2011
It is shown that the problem of deciding whether a state is almost-sure winning in energy parity MDPs is in NP ∩ coNP, while for mean-payoff parity M DPs, the problem is solvable in polynomial time.
• Computer Science
Machine Learning
• 2004
A convergence theorem is presented and proves that Q -learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action- values are represented discretely.
This paper introduces the class of prefix-independent and submixing payoff functions, and it is proved that any MDP equipped with such a payoff function admits pure stationary optimal strategies.
• Computer Science
2015 30th Annual ACM/IEEE Symposium on Logic in Computer Science
• 2015
The multidimensional BAS threshold problem is solvable in P. This solves the infinite-memory threshold problem left open by Bruyère et al., and this complexity cannot be improved without improving the currently known complexity of classical mean-payoff games.
• Computer Science
ICALP
• 2017
This work extends the framework of [BFRR14] and follow-up papers, by addressing the case of $\omega$-regular conditions encoded as parity objectives, a natural way to represent functional requirements of systems by establishing that, for all variants of this problem, deciding the existence of a strategy lies in ${\sf NP} \cap {\sf coNP}$.
• Computer Science
CONCUR
• 2016
This work studies, for the first time, mean-payoff games in which the system aims at minimizing the expected cost against a probabilistic environment, while surely satisfying an $\omega$-regular condition against an adversarial environment.
• Computer Science
IJCAI
• 2016
This work proposes a probably approximately correct (PAC) learning algorithm that can learn a controller synthesis problem in turn-based stochastic games with both a qualitative linear temporal logic constraint and a quantitative discounted-sum objective in an online manner.
• Computer Science
ATVA
• 2016
This work considers the problem of computing a safe strategy (i.e., a strategy that keeps the counter non-negative) which maximizes the expected mean payoff.
• Computer Science
ATVA
• 2014
This paper provides efficient methods for computing reachability strategies that will both ensure worst case time-bounds as well as provide (near-) minimal expected cost.