Share This Author
Bandit Based Monte-Carlo Planning
A new algorithm is introduced, UCT, that applies bandit ideas to guide Monte-Carlo planning and is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling.
Improved Algorithms for Linear Stochastic Bandits
A simple modification of Auer's UCB algorithm achieves with high probability constant regret and improves the regret bound by a logarithmic factor, though experiments show a vast improvement.
Fast gradient-descent methods for temporal-difference learning with linear function approximation
Two new related algorithms with better convergence rates are introduced: the first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD).
Algorithms for Reinforcement Learning
- Csaba Szepesvari
- Computer ScienceAlgorithms for Reinforcement Learning
- 25 June 2010
This book focuses on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming, and gives a fairly comprehensive catalog of learning problems, and describes the core ideas, followed by the discussion of their theoretical properties and limitations.
Regret Bounds for the Adaptive Control of Linear Quadratic Systems
The construction of the condence set is based on the recent results from online least-squares estimation and leads to improved worst-case regret bound for the proposed algorithm, and is the the rst time that a regret bound is derived for the LQ control problem.
Finite-Time Bounds for Fitted Value Iteration
A theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available.
Parametric Bandits: The Generalized Linear Case
The analysis highlights a key difficulty in generalizing linear bandit algorithms to the non-linear case, which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space, and provides a tuning method based on asymptotic arguments, which leads to significantly better practical performance.
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits
Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
A finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept, the approximation power of thefunction set and the controllability properties of the MDP is found.
- Sébastien Bubeck, R. Munos, Gilles Stoltz, Csaba Szepesvari
- Computer Science, MathematicsJ. Mach. Learn. Res.
- 25 January 2010
We consider a generalization of stochastic bandits where the set of arms, X, is allowed to be a generic measurable space and the mean-payoff function is "locally Lipschitz" with respect to a…