Corpus ID: 7789895

Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

@inproceedings{Dann2017UnifyingPA,
  title={Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning},
  author={Christoph Dann and Tor Lattimore and Emma Brunskill},
  booktitle={NIPS},
  year={2017}
}
Statistical performance bounds for reinforcement learning (RL) algorithms can be critical for high-stakes applications like healthcare. This paper introduces a new framework for theoretically measuring the performance of such algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework. In contrast to the PAC framework, the uniform version may be used to derive high probability regret guarantees and so forms a bridge between the two… Expand
Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation
TLDR
The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making the proposed FLUTE algorithm superior to all existing algorithms with linear function approximation. Expand
Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning
TLDR
It is shown that optimistic algorithms can not achieve the information-theoretic lower bounds even in deterministic MDPs unless there is a unique optimal policy, and tighter upper regret bounds for optimistic algorithms are proved. Expand
PAC Guarantees for Concurrent Reinforcement Learning with Restricted Communication
TLDR
This work develops model free PAC performance guarantees for multiple concurrent MDPs, extending recent works where a single learner interacts with multiple non-interacting agents in a noise free environment and develops novel PAC guarantees in this extended setting. Expand
$\gamma$-Regret for Non-Episodic Reinforcement Learning.
TLDR
It is argued that if the total time budget is relatively limited compared to the complexity of the environment, such comparison may fail to reflect the finite-time optimality of the learner. Expand
Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs
TLDR
It is found that a very minor variant of a recently proposed reinforcement learning algorithm for MDPs already matches the best possible regret bound $\tilde O (\sqrt{SAT})$ in the dominant term if deployed on a tabular Contextual Bandit problem despite the agent being agnostic to such setting. Expand
Beyond No Regret: Instance-Dependent PAC Reinforcement Learning
TLDR
This work shows that there exists a fundamental tradeoff between achieving low regret and identifying an -optimal policy at the instance-optimal rate, and proposes a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. Expand
Private Reinforcement Learning with PAC and Regret Guarantees
TLDR
A private optimism-based learning algorithm is developed that simultaneously achieves strong PAC and regret bounds, and enjoys a JDP guarantee, and presents lower bounds on sample complexity and regret for reinforcement learning subject to JDP. Expand
Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds
TLDR
An algorithm for finite horizon discrete MDPs and associated analysis that both yields state-of-the art worst-case regret bounds in the dominant terms and yields substantially tighter bounds if the RL environment has small environmental norm, which is a function of the variance of the next-state value functions. Expand
Policy Certificates: Towards Accountable Reinforcement Learning
TLDR
One of the algorithms introduced is the first to achieve minimax-optimal PAC bounds up to lower-order terms, and this algorithm also matches (and in some settings slightly improves upon) existing minimax regret bounds. Expand
On Oracle-Efficient PAC Reinforcement Learning with Rich Observations
TLDR
It is proved that the only known sample-efficient algorithm, Olive, cannot be implemented in the oracle model, and new sample- efficient algorithms are presented for environments with deterministic hidden state dynamics and stochastic rich observations. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 35 REFERENCES
Reinforcement Learning in Finite MDPs: PAC Analysis
TLDR
The current state-of-the-art for near-optimal behavior in finite Markov Decision Processes with a polynomial number of samples is summarized by presenting bounds for the problem in a unified theoretical framework. Expand
Near-optimal PAC bounds for discounted MDPs
TLDR
A new bound is proved for a modified version of Upper Confidence Reinforcement Learning with only cubic dependence on the horizon, which is unimprovable in all parameters except the size of the state/action space, where it depends linearly on the number of non-zero transition probabilities. Expand
REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs
TLDR
An algorithm is provided that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP) where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. Expand
Near-optimal Regret Bounds for Reinforcement Learning
TLDR
This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps. Expand
Near-Optimal Reinforcement Learning in Polynomial Time
TLDR
New algorithms for reinforcement learning are presented and it is proved that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. Expand
Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning
TLDR
The upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-Horizon dependency of at least $H^3$. Expand
Model-based reinforcement learning with nearly tight exploration complexity bounds
TLDR
Mormax, a modified version of the Rmax algorithm, is shown to need to make at most O(N log N) exploratory steps, which matches the lower bound up to logarithmic factors, as well as the upper bound of the state-of-the-art model-free algorithm, while the new bound improves the dependence on other problem parameters. Expand
Contextual Decision Processes with low Bellman rank are PAC-Learnable
TLDR
A complexity measure, the Bellman rank, is presented that enables tractable learning of near-optimal behavior in CDPs and is naturally small for many well-studied RL models and provides new insights into efficient exploration for RL with function approximation. Expand
Why is Posterior Sampling Better than Optimism for Reinforcement Learning?
TLDR
An Bayesian expected regret bound for PSRL in finite-horizon episodic Markov decision processes is established, which improves upon the best previous bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm. Expand
Using upper confidence bounds for online learning
  • P. Auer
  • Computer Science
  • Proceedings 41st Annual Symposium on Foundations of Computer Science
  • 2000
TLDR
It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation/exploration trade-off and extends the results for the adversarial bandit problem to shifting bandits. Expand
...
1
2
3
4
...