• Publications
  • Influence
Bandit Based Monte-Carlo Planning
TLDR
We introduce a new algorithm, UCT, that applies bandit ideas to guide Monte-Carlo planning. Expand
  • 2,315
  • 464
  • PDF
Improved Algorithms for Linear Stochastic Bandits
TLDR
We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear Stochastic Multi-armed bandits problem. Expand
  • 664
  • 249
  • PDF
Fast gradient-descent methods for temporal-difference learning with linear function approximation
TLDR
We introduce two new temporal-difference learning algorithms with better convergence rates, which can be used to extend linear TD to off-policy learning. Expand
  • 456
  • 94
  • PDF
Algorithms for Reinforcement Learning
  • Csaba Szepesvari
  • Computer Science
  • Algorithms for Reinforcement Learning
  • 25 June 2010
TLDR
Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. Expand
  • 843
  • 75
  • PDF
Regret Bounds for the Adaptive Control of Linear Quadratic Systems
TLDR
We study the average cost Linear Quadratic (LQ) control problem with unknown model parameters and prove that apart from logarithmic factors its regret up to time T is O( p T ). Expand
  • 179
  • 58
  • PDF
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits
TLDR
This paper considers a variant of the basic algorithm for the stochastic multi-armed bandit problem that takes into account the empirical variance of the different arms. Expand
  • 409
  • 52
  • PDF
Parametric Bandits: The Generalized Linear Case
TLDR
We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. Expand
  • 230
  • 51
  • PDF
Finite-Time Bounds for Fitted Value Iteration
TLDR
In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes. Expand
  • 273
  • 48
  • PDF
Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
TLDR
We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Expand
  • 253
  • 41
  • PDF
A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation
TLDR
We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, target policy, and exciting behavior policy, without its quadratic computational complexity. Expand
  • 182
  • 40