Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints

@inproceedings{Ketnsk2018LearningBasedMO,
  title={Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints},
  author={Jan Křet{\'i}nsk{\'y} and Guillermo A. P{\'e}rez and Jean-François Raskin},
  booktitle={International Conference on Concurrency Theory},
  year={2018}
}
We formalize the problem of maximizing the mean-payoff value with high probability while satisfying a parity objective in a Markov decision process (MDP) with unknown probabilistic transition function and unknown reward function. Assuming the support of the unknown transition function and a lower bound on the minimal transition probability are known in advance, we show that in MDPs consisting of a single end component, two combinations of guarantees on the parity and mean-payoff objectives can… 

Figures from this paper

Learning-Based Mean-Payoff Optimization in Unknown Markov Decision Processes under Omega-Regular Constraints ?

We formalize the problem of maximizing the mean-payoff value with high probability while satisfying a parity objective in a Markov decision process (MDP) with unknown probabilistic transition

PAC Statistical Model Checking of Mean Payoff in Discrete- and Continuous-Time MDP

This work provides the first algorithm to compute mean payoff probably approximately correctly in unknown MDP; further, it is extended to unknown CTMDP and demonstrates its practical nature by running experiments on standard benchmarks.

Large Scale Markov Decision Processes with Changing Rewards

An algorithm is provided that achieves state-of-the-art regret bound of $O( \tilde{O}(\sqrt{T})$ regret bound for large scale MDPs with changing rewards, which to the best of the knowledge is the first.

Online Learning of Non-Markovian Reward Models

The results show that using Angluin's active learning algorithm to learn an MRM in a non-Markovian reward decision process is effective and it is proved that the expected reward achieved will eventually be at least as much as a given, reasonable value provided by a domain expert.

Finite-Memory Near-Optimal Learning for Markov Decision Processes with Long-Run Average Reward

It is shown that near optimality can be achieved almost surely, using an unintuitive gadget the authors call forgetfulness, and the approach is extended to a setting with partial knowledge of the system topology, introducing two optimality measures and providing near-optimal algorithms also for these cases.

Learning Non-Markovian Reward Models in MDPs

The approach is a careful combination of the Angluin's L* active learning algorithm to learn finite automata, testing techniques for establishing conformance of finite model hypothesis and optimisation techniques for computing optimal strategies in Markovian (immediate) reward MDPs.

Probably Approximately Correct Learning in Adversarial Environments With Temporal Logic Specifications

  • Min WenU. Topcu
  • Computer Science
    IEEE Transactions on Automatic Control
  • 2022
This article model the interaction between an RL agent and its potentially adversarial environment as a turn-based zero-sum stochastic game and proposes a probably approximately correct (PAC) learning algorithm that learns such a strategy efficiently in an online manner with unknown reward functions and unknown transition distributions.

Model-Free Reinforcement Learning for Lexicographic Omega-Regular Objectives

. We study the problem of finding optimal strategies in Markov decision processes with lexicographic ω -regular objectives, which are ordered collections of ordinary ω -regular objectives. The goal is

Safe Reinforcement Learning via Probabilistic Shields

This paper introduces the concept of a probabilistic shield that enables decision-making to adhere to safety constraints with high probability and discusses tradeoffs between sufficient progress in exploration of the environment and ensuring safety.

Safe Reinforcement Learning Using Probabilistic Shields

The concept of a probabilistic shield that enables RL decision-making to adhere to safety constraints with high probability is introduced and used to realize a shield that restricts the agent from taking unsafe actions, while optimizing the performance objective.

References

SHOWING 1-10 OF 40 REFERENCES

Optimizing Expectation with Guarantees in POMDPs

This work goes beyond both the “expectation” and “threshold” approaches and considers a “guaranteed payoff optimization (GPO)” problem for POMDPs, where the objective is to find a policy σ such that each possible outcome yields a discounted-sum payoff of at least t.

Energy and Mean-Payoff Parity Markov Decision Processes

It is shown that the problem of deciding whether a state is almost-sure winning in energy parity MDPs is in NP ∩ coNP, while for mean-payoff parity M DPs, the problem is solvable in polynomial time.

Technical Note: Q-Learning

A convergence theorem is presented and proves that Q -learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action- values are represented discretely.

Pure Stationary Optimal Strategies in Markov Decision Processes

This paper introduces the class of prefix-independent and submixing payoff functions, and it is proved that any MDP equipped with such a payoff function admits pure stationary optimal strategies.

Multidimensional beyond Worst-Case and Almost-Sure Problems for Mean-Payoff Objectives

The multidimensional BAS threshold problem is solvable in P. This solves the infinite-memory threshold problem left open by Bruyère et al., and this complexity cannot be improved without improving the currently known complexity of classical mean-payoff games.

Threshold Constraints with Guarantees for Parity Objectives in Markov Decision Processes

This work extends the framework of [BFRR14] and follow-up papers, by addressing the case of $\omega$-regular conditions encoded as parity objectives, a natural way to represent functional requirements of systems by establishing that, for all variants of this problem, deciding the existence of a strategy lies in ${\sf NP} \cap {\sf coNP}$.

Minimizing Expected Cost Under Hard Boolean Constraints, with Applications to Quantitative Synthesis

This work studies, for the first time, mean-payoff games in which the system aims at minimizing the expected cost against a probabilistic environment, while surely satisfying an $\omega$-regular condition against an adversarial environment.

Probably Approximately Correct Learning in Stochastic Games with Temporal Logic Specifications

This work proposes a probably approximately correct (PAC) learning algorithm that can learn a controller synthesis problem in turn-based stochastic games with both a qualitative linear temporal logic constraint and a quantitative discounted-sum objective in an online manner.

Optimizing the Expected Mean Payoff in Energy Markov Decision Processes

This work considers the problem of computing a safe strategy (i.e., a strategy that keeps the counter non-negative) which maximizes the expected mean payoff.

On Time with Minimal Expected Cost!

This paper provides efficient methods for computing reachability strategies that will both ensure worst case time-bounds as well as provide (near-) minimal expected cost.