Finite-time Analysis of the Multiarmed Bandit Problem

  title={Finite-time Analysis of the Multiarmed Bandit Problem},
  author={Peter Auer and Nicol{\`o} Cesa-Bianchi and Paul Fischer},
  journal={Machine Learning},
Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi… 

Lenient Regret for Multi-Armed Bandits

A new, more lenient, regret criterion is suggested that ignores suboptimality gaps smaller than some ε, and a variant of the Thompson Sampling algorithm, called ε-TS, is presented, and its asymptotic optimality is proved in terms of the lenient regret.

An asymptotically optimal policy for finite support models in the multiarmed bandit problem

The minimum empirical divergence (MED) policy is proposed and an upper bound on the finite-time regret is derived which meets the asymptotic bound for the case of finite support models.

Risk-Averse Explore-Then-Commit Algorithms for Finite-Time Bandits

Using a new notion of finite-time exploitation regret, an upper bound of order ln (1/(ϵ)) for the minimum number of experiments before commitment is found, to guarantee upper bound ϵ for regret.

Pure Exploration in Multi-armed Bandits Problems

The main result is that the required exploration-exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret.

On the evolution of the expected gain of a greedy action in the bandit problem

This paper defines and gives an analytical definition of the expected gain of a greedy action μg and studies its evolution over the time and confirms analytically and experimentally that exploitation before acquiring enough knowledge on the arms is a bad practice.

Multi-armed bandit problems with heavy-tailed reward distributions

  • Keqin LiuQing Zhao
  • Computer Science
    2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
  • 2011
An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies and it is shown that when the moment-generating functions of the arm reward distributions are properly bounded, the optimal logarithmic order of the regret can be achieved by DSEE.

Pure Exploration for Multi-Armed Bandit Problems

It is able to prove that the separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous mean-payoff functions.

Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards

This paper fully characterize the (regret) complexity of this class of MAB problems by establishing a direct link between the extent of allowable reward "variation" and the minimal achievable regret, and draws some connections between two rather disparate strands of literature.

A structured multiarmed bandit problem and the greedy policy

In the infinite horizon discounted reward setting, it is shown that both the greedy and optimal policies eventually coincide and settle on the best arm, in contrast with the Incomplete Learning Theorem for the case of independent arms.

A dynamic programming strategy to balance exploration and exploitation in the bandit problem

The concept of “expected reward of greedy actions” which is based on the notion of probability of correct selection (PCS) is used in an original semi-uniform algorithm which relies on the dynamic programming framework and on estimation techniques to optimally balance exploration and exploitation.



Gambling in a rigged casino: The adversarial multi-armed bandit problem

A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs is given.

Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

  • R. Agrawal
  • Computer Science, Mathematics
    Advances in Applied Probability
  • 1995
This paper constructs index policies that depend on the rewards from each arm only through their sample mean, and achieves a O(log n) regret with a constant that is based on the Kullback–Leibler number.

Q-Learning for Bandit Problems

Nonparametric bandit methods

The motivation is “machine learning” in which a game-playing or assembly-line adjusting computer is faced with a sequence of statistically-similar decision problems and, as resource, has access to an expanding data base relevant to these problems.

Reinforcement Learning: An Introduction

This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

Reinforcement Learning: A Survey

Central issues of reinforcement learning are discussed, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state.

Learning in embedded systems

This dissertation addresses the problem of designing algorithms for learning in embedded systems using Sutton's techniques for linear association and reinforcement comparison, while the interval estimation algorithm uses the statistical notion of confidence intervals to guide its generation of actions.

Learning to Act Using Real-Time Dynamic Programming

Thermodynamical Approach to the Traveling Salesman Problem : An Efficient Simulation Algorithm

It is conjecture that the analogy with thermodynamics can offer a new insight into optimization problems and can suggest efficient algorithms for solving them.

Cooling Schedules for Optimal Annealing

A Monte Carlo optimization technique called “simulated annealing” is a descent algorithm modified by random ascent moves in order to escape local minima which are not global minima. The level of