# Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

@inproceedings{Dann2017UnifyingPA, title={Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning}, author={Christoph Dann and Tor Lattimore and Emma Brunskill}, booktitle={NIPS}, year={2017} }

Statistical performance bounds for reinforcement learning (RL) algorithms can be critical for high-stakes applications like healthcare. This paper introduces a new framework for theoretically measuring the performance of such algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework. In contrast to the PAC framework, the uniform version may be used to derive high probability regret guarantees and so forms a bridge between the two… Expand

#### 149 Citations

Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation

- Computer Science, Mathematics
- ArXiv
- 2021

The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making the proposed FLUTE algorithm superior to all existing algorithms with linear function approximation. Expand

Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning

- Computer Science
- ArXiv
- 2021

It is shown that optimistic algorithms can not achieve the information-theoretic lower bounds even in deterministic MDPs unless there is a unique optimal policy, and tighter upper regret bounds for optimistic algorithms are proved. Expand

PAC Guarantees for Concurrent Reinforcement Learning with Restricted Communication

- Computer Science, Mathematics
- ArXiv
- 2019

This work develops model free PAC performance guarantees for multiple concurrent MDPs, extending recent works where a single learner interacts with multiple non-interacting agents in a noise free environment and develops novel PAC guarantees in this extended setting. Expand

$\gamma$-Regret for Non-Episodic Reinforcement Learning.

- Computer Science
- 2020

It is argued that if the total time budget is relatively limited compared to the complexity of the environment, such comparison may fail to reflect the finite-time optimality of the learner. Expand

Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs

- Computer Science, Mathematics
- ICML
- 2018

It is found that a very minor variant of a recently proposed reinforcement learning algorithm for MDPs already matches the best possible regret bound $\tilde O (\sqrt{SAT})$ in the dominant term if deployed on a tabular Contextual Bandit problem despite the agent being agnostic to such setting. Expand

Beyond No Regret: Instance-Dependent PAC Reinforcement Learning

- Computer Science, Mathematics
- ArXiv
- 2021

This work shows that there exists a fundamental tradeoff between achieving low regret and identifying an -optimal policy at the instance-optimal rate, and proposes a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. Expand

Private Reinforcement Learning with PAC and Regret Guarantees

- Computer Science, Mathematics
- ICML
- 2020

A private optimism-based learning algorithm is developed that simultaneously achieves strong PAC and regret bounds, and enjoys a JDP guarantee, and presents lower bounds on sample complexity and regret for reinforcement learning subject to JDP. Expand

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

- Computer Science, Mathematics
- ICML
- 2019

An algorithm for finite horizon discrete MDPs and associated analysis that both yields state-of-the art worst-case regret bounds in the dominant terms and yields substantially tighter bounds if the RL environment has small environmental norm, which is a function of the variance of the next-state value functions. Expand

Policy Certificates: Towards Accountable Reinforcement Learning

- Computer Science, Mathematics
- ICML
- 2019

One of the algorithms introduced is the first to achieve minimax-optimal PAC bounds up to lower-order terms, and this algorithm also matches (and in some settings slightly improves upon) existing minimax regret bounds. Expand

On Oracle-Efficient PAC Reinforcement Learning with Rich Observations

- Computer Science
- 2018

It is proved that the only known sample-efficient algorithm, Olive, cannot be implemented in the oracle model, and new sample- efficient algorithms are presented for environments with deterministic hidden state dynamics and stochastic rich observations. Expand

#### References

SHOWING 1-10 OF 35 REFERENCES

Reinforcement Learning in Finite MDPs: PAC Analysis

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2009

The current state-of-the-art for near-optimal behavior in finite Markov Decision Processes with a polynomial number of samples is summarized by presenting bounds for the problem in a unified theoretical framework. Expand

Near-optimal PAC bounds for discounted MDPs

- Computer Science, Mathematics
- Theor. Comput. Sci.
- 2014

A new bound is proved for a modified version of Upper Confidence Reinforcement Learning with only cubic dependence on the horizon, which is unimprovable in all parameters except the size of the state/action space, where it depends linearly on the number of non-zero transition probabilities. Expand

REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs

- Computer Science, Mathematics
- UAI
- 2009

An algorithm is provided that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP) where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. Expand

Near-optimal Regret Bounds for Reinforcement Learning

- Computer Science, Mathematics
- J. Mach. Learn. Res.
- 2008

This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps. Expand

Near-Optimal Reinforcement Learning in Polynomial Time

- Computer Science, Mathematics
- Machine Learning
- 2004

New algorithms for reinforcement learning are presented and it is proved that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. Expand

Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning

- Computer Science, Mathematics
- NIPS
- 2015

The upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-Horizon dependency of at least $H^3$. Expand

Model-based reinforcement learning with nearly tight exploration complexity bounds

- Mathematics, Computer Science
- ICML
- 2010

Mormax, a modified version of the Rmax algorithm, is shown to need to make at most O(N log N) exploratory steps, which matches the lower bound up to logarithmic factors, as well as the upper bound of the state-of-the-art model-free algorithm, while the new bound improves the dependence on other problem parameters. Expand

Contextual Decision Processes with low Bellman rank are PAC-Learnable

- Computer Science, Mathematics
- ICML
- 2017

A complexity measure, the Bellman rank, is presented that enables tractable learning of near-optimal behavior in CDPs and is naturally small for many well-studied RL models and provides new insights into efficient exploration for RL with function approximation. Expand

Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

- Mathematics, Computer Science
- ICML
- 2017

An Bayesian expected regret bound for PSRL in finite-horizon episodic Markov decision processes is established, which improves upon the best previous bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm. Expand

Using upper confidence bounds for online learning

- Computer Science
- Proceedings 41st Annual Symposium on Foundations of Computer Science
- 2000

It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation/exploration trade-off and extends the results for the adversarial bandit problem to shifting bandits. Expand