# Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints

@inproceedings{Ketnsk2018LearningBasedMO, title={Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints}, author={Jan Křet{\'i}nsk{\'y} and Guillermo A. P{\'e}rez and Jean-François Raskin}, booktitle={International Conference on Concurrency Theory}, year={2018} }

We formalize the problem of maximizing the mean-payoff value with high probability while satisfying a parity objective in a Markov decision process (MDP) with unknown probabilistic transition function and unknown reward function. Assuming the support of the unknown transition function and a lower bound on the minimal transition probability are known in advance, we show that in MDPs consisting of a single end component, two combinations of guarantees on the parity and mean-payoff objectives can…

## 18 Citations

### Learning-Based Mean-Payoff Optimization in Unknown Markov Decision Processes under Omega-Regular Constraints ?

- Computer Science
- 2019

We formalize the problem of maximizing the mean-payoff value with high probability while satisfying a parity objective in a Markov decision process (MDP) with unknown probabilistic transition…

### PAC Statistical Model Checking of Mean Payoff in Discrete- and Continuous-Time MDP

- Computer ScienceCAV
- 2022

This work provides the first algorithm to compute mean payoff probably approximately correctly in unknown MDP; further, it is extended to unknown CTMDP and demonstrates its practical nature by running experiments on standard benchmarks.

### Large Scale Markov Decision Processes with Changing Rewards

- Computer Science, MathematicsNeurIPS
- 2019

An algorithm is provided that achieves state-of-the-art regret bound of $O( \tilde{O}(\sqrt{T})$ regret bound for large scale MDPs with changing rewards, which to the best of the knowledge is the first.

### Online Learning of Non-Markovian Reward Models

- Computer ScienceICAART
- 2021

The results show that using Angluin's active learning algorithm to learn an MRM in a non-Markovian reward decision process is effective and it is proved that the expected reward achieved will eventually be at least as much as a given, reasonable value provided by a domain expert.

### Finite-Memory Near-Optimal Learning for Markov Decision Processes with Long-Run Average Reward

- Computer ScienceUAI
- 2020

It is shown that near optimality can be achieved almost surely, using an unintuitive gadget the authors call forgetfulness, and the approach is extended to a setting with partial knowledge of the system topology, introducing two optimality measures and providing near-optimal algorithms also for these cases.

### Learning Non-Markovian Reward Models in MDPs

- Computer ScienceArXiv
- 2020

The approach is a careful combination of the Angluin's L* active learning algorithm to learn finite automata, testing techniques for establishing conformance of finite model hypothesis and optimisation techniques for computing optimal strategies in Markovian (immediate) reward MDPs.

### Probably Approximately Correct Learning in Adversarial Environments With Temporal Logic Specifications

- Computer ScienceIEEE Transactions on Automatic Control
- 2022

This article model the interaction between an RL agent and its potentially adversarial environment as a turn-based zero-sum stochastic game and proposes a probably approximately correct (PAC) learning algorithm that learns such a strategy efficiently in an online manner with unknown reward functions and unknown transition distributions.

### Model-Free Reinforcement Learning for Lexicographic Omega-Regular Objectives

- BusinessFM
- 2021

. We study the problem of ﬁnding optimal strategies in Markov decision processes with lexicographic ω -regular objectives, which are ordered collections of ordinary ω -regular objectives. The goal is…

### Safe Reinforcement Learning via Probabilistic Shields

- Computer Science
- 2018

This paper introduces the concept of a probabilistic shield that enables decision-making to adhere to safety constraints with high probability and discusses tradeoffs between sufficient progress in exploration of the environment and ensuring safety.

### Safe Reinforcement Learning Using Probabilistic Shields

- Computer Science
- 2020

The concept of a probabilistic shield that enables RL decision-making to adhere to safety constraints with high probability is introduced and used to realize a shield that restricts the agent from taking unsafe actions, while optimizing the performance objective.

## References

SHOWING 1-10 OF 40 REFERENCES

### Optimizing Expectation with Guarantees in POMDPs

- EconomicsAAAI
- 2017

This work goes beyond both the “expectation” and “threshold” approaches and considers a “guaranteed payoff optimization (GPO)” problem for POMDPs, where the objective is to find a policy σ such that each possible outcome yields a discounted-sum payoff of at least t.

### Energy and Mean-Payoff Parity Markov Decision Processes

- EconomicsMFCS
- 2011

It is shown that the problem of deciding whether a state is almost-sure winning in energy parity MDPs is in NP ∩ coNP, while for mean-payoff parity M DPs, the problem is solvable in polynomial time.

### Technical Note: Q-Learning

- Computer ScienceMachine Learning
- 2004

A convergence theorem is presented and proves that Q -learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action- values are represented discretely.

### Pure Stationary Optimal Strategies in Markov Decision Processes

- EconomicsSTACS
- 2007

This paper introduces the class of prefix-independent and submixing payoff functions, and it is proved that any MDP equipped with such a payoff function admits pure stationary optimal strategies.

### Multidimensional beyond Worst-Case and Almost-Sure Problems for Mean-Payoff Objectives

- Computer Science2015 30th Annual ACM/IEEE Symposium on Logic in Computer Science
- 2015

The multidimensional BAS threshold problem is solvable in P. This solves the infinite-memory threshold problem left open by Bruyère et al., and this complexity cannot be improved without improving the currently known complexity of classical mean-payoff games.

### Threshold Constraints with Guarantees for Parity Objectives in Markov Decision Processes

- Computer ScienceICALP
- 2017

This work extends the framework of [BFRR14] and follow-up papers, by addressing the case of $\omega$-regular conditions encoded as parity objectives, a natural way to represent functional requirements of systems by establishing that, for all variants of this problem, deciding the existence of a strategy lies in ${\sf NP} \cap {\sf coNP}$.

### Minimizing Expected Cost Under Hard Boolean Constraints, with Applications to Quantitative Synthesis

- Computer ScienceCONCUR
- 2016

This work studies, for the first time, mean-payoff games in which the system aims at minimizing the expected cost against a probabilistic environment, while surely satisfying an $\omega$-regular condition against an adversarial environment.

### Probably Approximately Correct Learning in Stochastic Games with Temporal Logic Specifications

- Computer ScienceIJCAI
- 2016

This work proposes a probably approximately correct (PAC) learning algorithm that can learn a controller synthesis problem in turn-based stochastic games with both a qualitative linear temporal logic constraint and a quantitative discounted-sum objective in an online manner.

### Optimizing the Expected Mean Payoff in Energy Markov Decision Processes

- Computer ScienceATVA
- 2016

This work considers the problem of computing a safe strategy (i.e., a strategy that keeps the counter non-negative) which maximizes the expected mean payoff.

### On Time with Minimal Expected Cost!

- Computer ScienceATVA
- 2014

This paper provides efficient methods for computing reachability strategies that will both ensure worst case time-bounds as well as provide (near-) minimal expected cost.