• Publications
  • Influence
Near-optimal Regret Bounds for Reinforcement Learning
TLDR
We present a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. Expand
  • 720
  • 191
  • PDF
UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem
  • P. Auer, R. Ortner
  • Mathematics, Computer Science
  • Period. Math. Hung.
  • 1 October 2010
TLDR
We show an improved bound on the regret with respect to the optimal reward in the stochastic multi-armed bandit problem. Expand
  • 203
  • 47
  • PDF
Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
TLDR
We present a learning algorithm for undiscounted reinforcement learning that achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. Expand
  • 164
  • 28
  • PDF
Improved Rates for the Stochastic Continuum-Armed Bandit Problem
TLDR
We propose an improvement of an algorithm of Kleinberg and a new set of conditions which give rise to improved rates for one-dimensional continuum-armed bandit problems. Expand
  • 169
  • 22
  • PDF
A Boosting Approach to Multiple Instance Learning
TLDR
We present a boosting approach to multiple instance learning that was inspired by problems stemming from generic object recognition. Expand
  • 82
  • 11
  • PDF
Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
TLDR
We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. Expand
  • 62
  • 8
  • PDF
Variational Regret Bounds for Reinforcement Learning
TLDR
We consider undiscounted reinforcement learning in Markov decision processes (MDPs) where both the reward functions and the state-transition probabilities may vary (gradually or abruptly) over time. Expand
  • 21
  • 7
  • PDF
A Sliding-Window Algorithm for Markov Decision Processes with Arbitrarily Changing Rewards and Transitions
TLDR
We consider reinforcement learning in changing Markov Decision Processes where both the state-transition probabilities and the reward functions may vary over time. Expand
  • 19
  • 7
  • PDF
Regret bounds for restless Markov bandits
TLDR
We consider the restless bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. Expand
  • 64
  • 6
  • PDF
Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning
TLDR
We introduce SCAL, an algorithm designed to perform efficient exploration-exploitation in any weakly-communicating Markov decision process (MDP) for which an upper bound $c$ on the span of the optimal bias function is known. Expand
  • 52
  • 5
  • PDF