Corpus ID: 235732202

Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning

  title={Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning},
  author={Christoph Dann and Teodor Vanislavov Marinov and Mehryar Mohri and Julian Zimmert},
We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes. Compared to prior work, our bounds depend on alternative definitions of gaps. These definitions are based on the insight that, in order to achieve a favorable regret, an algorithm does not need to learn how to behave optimally in states that are not reached by an optimal policy. We prove tighter upper regret bounds for optimistic algorithms and accompany them with new… Expand
1 Citations

Figures from this paper

Can Q-Learning be Improved with Advice?
This paper addresses the question of whether worst-case lower bounds for regret in online learning of Markov decision processes (MDPs) can be circumvented when information about the MDP, in the form of predictions about its optimal Q-value function, is given to the algorithm. Expand


Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs
This paper establishes that optimistic algorithms attain gap-dependent and non-asymptotic logarithmic regret for episodic MDPs. In contrast to prior work, our bounds do not suffer a dependence onExpand
Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds
An algorithm for finite horizon discrete MDPs and associated analysis that both yields state-of-the art worst-case regret bounds in the dominant terms and yields substantially tighter bounds if the RL environment has small environmental norm, which is a function of the variance of the next-state value functions. Expand
Exploration in Structured Reinforcement Learning
DEL (Directed Exploration Learning), an algorithm that matches problem-specific regret lower bounds satisfied by any learning algorithm, is devised and simplified for Lipschitz MDPs, and it is shown that the simplified version is still able to efficiently exploit the structure. Expand
Corruption Robust Exploration in Episodic Reinforcement Learning
This work provides the first sublinear regret guarantee which accommodates any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning, and derives results for both tabular and linear-function-approximation settings. Expand
Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap
A new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB), which enjoys a stronger gap-dependent regret bound, and complements its upper bound with a lower bound showing the dependency on |Zmul| ∆min is unavoidable for any consistent algorithm. Expand
Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective
A family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds for contextual bandits are introduced and new oracle-efficient algorithms which adapt to the gap whenever possible are introduced, while also attaining the minimax rate in the worst case. Expand
Logarithmic Regret for Reinforcement Learning with Linear Function Approximation
It is shown that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. Expand
Q-learning with Logarithmic Regret
This paper presents the first non-asymptotic result showing that a model-free algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists aExpand
Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity
The open problem on agnostic $Q$-learning proposed in [Wen and Van Roy, NIPS 2013] is settled and the upper bound suggests that the sample complexity of $\widetilde{\Theta}\left(\rho/\sqrt{\mathrm{dim}_E\right)$ is tight even in the agnostic setting. Expand
Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition
This work develops the first algorithm with a ``best-of-both-worlds'' guarantee: it achieves $\mathcal{O}(log T)$ regret when the losses are stochastic, and simultaneously enjoys worst-case robustness with $\tilde{O}}(\sqrt{T})$ regret even when the loses are adversarial, where $T$ is the number of episodes. Expand