# Finite-time Analysis of the Multiarmed Bandit Problem

@article{Auer2002FinitetimeAO, title={Finite-time Analysis of the Multiarmed Bandit Problem}, author={Peter Auer and Nicol{\`o} Cesa-Bianchi and Paul Fischer}, journal={Machine Learning}, year={2002}, volume={47}, pages={235-256} }

Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi…

## 5,533 Citations

### Lenient Regret for Multi-Armed Bandits

- Computer ScienceAAAI
- 2021

A new, more lenient, regret criterion is suggested that ignores suboptimality gaps smaller than some ε, and a variant of the Thompson Sampling algorithm, called ε-TS, is presented, and its asymptotic optimality is proved in terms of the lenient regret.

### An asymptotically optimal policy for finite support models in the multiarmed bandit problem

- Computer ScienceMachine Learning
- 2011

The minimum empirical divergence (MED) policy is proposed and an upper bound on the finite-time regret is derived which meets the asymptotic bound for the case of finite support models.

### Risk-Averse Explore-Then-Commit Algorithms for Finite-Time Bandits

- Computer Science2019 IEEE 58th Conference on Decision and Control (CDC)
- 2019

Using a new notion of finite-time exploitation regret, an upper bound of order ln (1/(ϵ)) for the minimum number of experiments before commitment is found, to guarantee upper bound ϵ for regret.

### Pure Exploration in Multi-armed Bandits Problems

- Computer ScienceALT
- 2009

The main result is that the required exploration-exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret.

### On the evolution of the expected gain of a greedy action in the bandit problem

- Computer Science
- 2008

This paper defines and gives an analytical definition of the expected gain of a greedy action μg and studies its evolution over the time and confirms analytically and experimentally that exploitation before acquiring enough knowledge on the arms is a bad practice.

### Multi-armed bandit problems with heavy-tailed reward distributions

- Computer Science2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
- 2011

An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies and it is shown that when the moment-generating functions of the arm reward distributions are properly bounded, the optimal logarithmic order of the regret can be achieved by DSEE.

### Pure Exploration for Multi-Armed Bandit Problems

- Computer ScienceArXiv
- 2008

It is able to prove that the separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous mean-payoff functions.

### Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards

- Computer ScienceStochastic Systems
- 2019

This paper fully characterize the (regret) complexity of this class of MAB problems by establishing a direct link between the extent of allowable reward "variation" and the minimal achievable regret, and draws some connections between two rather disparate strands of literature.

### A structured multiarmed bandit problem and the greedy policy

- Computer Science2008 47th IEEE Conference on Decision and Control
- 2008

In the infinite horizon discounted reward setting, it is shown that both the greedy and optimal policies eventually coincide and settle on the best arm, in contrast with the Incomplete Learning Theorem for the case of independent arms.

### A dynamic programming strategy to balance exploration and exploitation in the bandit problem

- Computer ScienceAnnals of Mathematics and Artificial Intelligence
- 2010

The concept of “expected reward of greedy actions” which is based on the notion of probability of correct selection (PCS) is used in an original semi-uniform algorithm which relies on the dynamic programming framework and on estimation techniques to optimally balance exploration and exploitation.

## References

SHOWING 1-10 OF 27 REFERENCES

### Gambling in a rigged casino: The adversarial multi-armed bandit problem

- Computer ScienceProceedings of IEEE 36th Annual Foundations of Computer Science
- 1995

A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs is given.

### Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

- Computer Science, MathematicsAdvances in Applied Probability
- 1995

This paper constructs index policies that depend on the rewards from each arm only through their sample mean, and achieves a O(log n) regret with a constant that is based on the Kullback–Leibler number.

### Nonparametric bandit methods

- Computer Science
- 1991

The motivation is “machine learning” in which a game-playing or assembly-line adjusting computer is faced with a sequence of statistically-similar decision problems and, as resource, has access to an expanding data base relevant to these problems.

### Reinforcement Learning: An Introduction

- Computer ScienceIEEE Transactions on Neural Networks
- 2005

This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

### Reinforcement Learning: A Survey

- PsychologyJ. Artif. Intell. Res.
- 1996

Central issues of reinforcement learning are discussed, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state.

### Learning in embedded systems

- Computer Science
- 1993

This dissertation addresses the problem of designing algorithms for learning in embedded systems using Sutton's techniques for linear association and reinforcement comparison, while the interval estimation algorithm uses the statistical notion of confidence intervals to guide its generation of actions.

### Thermodynamical Approach to the Traveling Salesman Problem : An Efficient Simulation Algorithm

- Computer Science
- 2004

It is conjecture that the analogy with thermodynamics can offer a new insight into optimization problems and can suggest efficient algorithms for solving them.

### Cooling Schedules for Optimal Annealing

- MathematicsMath. Oper. Res.
- 1988

A Monte Carlo optimization technique called “simulated annealing” is a descent algorithm modified by random ascent moves in order to escape local minima which are not global minima. The level of…