Achieving Tractable Minimax Optimal Regret in Average Reward MDPs
- Victor BooneZihan Zhang
- 3 June 2024
Computer Science, Mathematics
This paper presents the first tractable algorithm with minimax optimal regret of $\widetilde{\mathrm{O}}(\sqrt{\mathrm{sp}(h^*) S A T})$, and relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently.
Logarithmic Regret of Exploration in Average Reward Markov Decision Processes
- Victor BooneBruno Gaujal
- 10 February 2025
Computer Science, Mathematics
The management of bad episodes is much better under (VM) than (DT) by making the regret of exploration logarithmic rather than linear, made possible by a new in-depth understanding of the contrasting behaviors of confidence regions during good and bad episodes.
The regret lower bound for communicating Markov Decision Processes
- Victor BooneOdalric-Ambrym Maillard
- 22 January 2025
Computer Science
This paper proves that the regret lower bound becomes significatively more complex in communicating MDPs, and revisits the necessary explorative behavior of consistent learning agents and explains that all optimal regions of the environment must be overvisited compared to sub-optimal ones.