Finite-time Analysis of the Multiarmed Bandit Problem

  title={Finite-time Analysis of the Multiarmed Bandit Problem},
  author={Peter Auer and Nicol{\`o} Cesa-Bianchi and Paul Fischer},
  journal={Machine Learning},
Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi… Expand
An asymptotically optimal policy for finite support models in the multiarmed bandit problem
The minimum empirical divergence (MED) policy is proposed and an upper bound on the finite-time regret is derived which meets the asymptotic bound for the case of finite support models. Expand
Risk-Averse Explore-Then-Commit Algorithms for Finite-Time Bandits
Using a new notion of finite-time exploitation regret, an upper bound of order ln (1/(ϵ)) for the minimum number of experiments before commitment is found, to guarantee upper bound ϵ for regret. Expand
Pure Exploration in Multi-armed Bandits Problems
The main result is that the required exploration-exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret. Expand
A Structured Multiarmed Bandit Problem and the Greedy Policy
In the infinite horizon discounted reward setting, it is shown that the greedy and optimal policies eventually coincide, and both settle on the best arm, and this is in contrast with the Incomplete Learning Theorem for the case of independent arms. Expand
On the evolution of the expected gain of a greedy action in the bandit problem
The K-armed bandit problem is a well-known formalization of the exploration versus exploitation dilemma. In a K-armed bandit problem, a player is confronted with a gambling machine with K arms whereExpand
Multi-armed bandit problems with heavy-tailed reward distributions
  • K. Liu, Qing Zhao
  • Computer Science, Mathematics
  • 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
  • 2011
An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies and it is shown that when the moment-generating functions of the arm reward distributions are properly bounded, the optimal logarithmic order of the regret can be achieved by DSEE. Expand
Pure Exploration for Multi-Armed Bandit Problems
It is able to prove that the separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous mean-payoff functions. Expand
Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards
This paper fully characterize the (regret) complexity of this class of MAB problems by establishing a direct link between the extent of allowable reward "variation" and the minimal achievable regret, and draws some connections between two rather disparate strands of literature. Expand
A structured multiarmed bandit problem and the greedy policy
In the infinite horizon discounted reward setting, it is shown that both the greedy and optimal policies eventually coincide and settle on the best arm, in contrast with the Incomplete Learning Theorem for the case of independent arms. Expand
A dynamic programming strategy to balance exploration and exploitation in the bandit problem
The concept of “expected reward of greedy actions” which is based on the notion of probability of correct selection (PCS) is used in an original semi-uniform algorithm which relies on the dynamic programming framework and on estimation techniques to optimally balance exploration and exploitation. Expand


Gambling in a rigged casino: The adversarial multi-armed bandit problem
A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs is given. Expand
We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases sldwly with time. In their seminal work onExpand
Q-Learning for Bandit Problems
  • M. Duff
  • Mathematics, Computer Science
  • ICML
  • 1995
This paper suggests utilizing task-state-specific Q-learning agents to solve their respective restart-in-state-$i$ subproblems, and includes an example in which the online reinforcement learning approach is applied to a simple problem of stochastic scheduling. Expand
Multi-Armed bandit problem revisited
In this paper, we revisit aspects of the multi-armed bandit problem in the earlier work (Ref. 1). An alternative proof of the optimality of the Gittins index rule is derived under the discountedExpand
Optimal Adaptive Policies for Sequential Allocation Problems
Consider the problem of sequential sampling frommstatistical populations to maximize the expected sum of outcomes in the long run. Under suitable assumptions on the unknown parametersformula, it isExpand
Nonparametric bandit methods
Bandits are a finite collection of random variables. Bandit problems are Markov decision problems in which, at each decision time, the decision maker selects a random variable (referred to as aExpand
Reinforcement Learning: An Introduction
This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications. Expand
Reinforcement Learning: A Survey
Central issues of reinforcement learning are discussed, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. Expand
Learning in embedded systems
This dissertation addresses the problem of designing algorithms for learning in embedded systems using Sutton's techniques for linear association and reinforcement comparison, while the interval estimation algorithm uses the statistical notion of confidence intervals to guide its generation of actions. Expand
Learning to Act Using Real-Time Dynamic Programming
An algorithm based on dynamic programming, which is called Real-Time DP, is introduced, by which an embedded system can improve its performance with experience and illuminate aspects of other DP-based reinforcement learning methods such as Watkins'' Q-Learning algorithm. Expand