We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear Stochastic Multi-armed bandits problem.Expand

We introduce two new temporal-difference learning algorithms with better convergence rates, which can be used to extend linear TD to off-policy learning.Expand

Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective.Expand

We study the average cost Linear Quadratic (LQ) control problem with unknown model parameters and prove that apart from logarithmic factors its regret up to time T is O( p T ).Expand

This paper considers a variant of the basic algorithm for the stochastic multi-armed bandit problem that takes into account the empirical variance of the different arms.Expand

In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes.Expand

We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual.Expand

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, target policy, and exciting behavior policy, without its quadratic computational complexity.Expand