On Lower Bounds for Regret in Reinforcement Learning


We consider the problem of learning to optimize an unknown MDP M∗ = (S,A, R∗, P ∗). S = {1, .., S} is the state space, A = {1, .., A} is the action space. In each timestep t = 1, 2, .. the agent observes a state st ∈ S, selects an action at ∈ A, receives a reward rt ∼ R(st, at) ∈ [0, 1] and transitions to a new state st+1 ∼ P (st, at). We define all random variables with respect to a probability space (Ω,F ,P).

Extracted Key Phrases

1 Figure or Table

Cite this paper

@article{Osband2016OnLB, title={On Lower Bounds for Regret in Reinforcement Learning}, author={Ian Osband and Benjamin Van Roy}, journal={CoRR}, year={2016}, volume={abs/1608.02732} }