• Corpus ID: 6345726

The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond

  title={The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond},
  author={Aur{\'e}lien Garivier and Olivier Capp{\'e}},
  booktitle={Annual Conference Computational Learning Theory},
This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon-free index policy for stochastic bandit problems. We prove two distinct results: first, for arbitrary bounded rewards, the KL-UCB algorithm satisfies a uniformly better regret bound than UCB or UCB2; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins. Furthermore, we show that simple adaptations of the KL-UCB algorithm are also optimal for specific classes of… 

Figures from this paper

Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms

This approach is shown, through numerical experiments, to significantly outperform existing algorithmsthat directly deal with the continuous set of arms and to prove that OSLB is asymPTotically optimal, as its asymptoticregret matches the lower bound.

On Upper-Confidence Bound Policies for Switching Bandit Problems

An upperbound for the expected regret is established by upper-bounding the expectation of the number of times suboptimal arms are played and it is shown that the discounted UCB and the sliding-window UCB both match the lower-bound up to a logarithmic factor.

A Note on KL-UCB+ Policy for the Stochastic Bandit

This note demonstrates that a simple proof of the asymptotic optimality of the KL-UCB+ policy can be given by the same technique as those used for analyses of other known policies.

Adaptive KL-UCB based Bandit Algorithms for Markovian and i.i.d. Settings

A novel algorithm is introduced that identifies whether the rewards from each arm are truly Markovian or i.i.d. rewards using a total variation distance-based test and switches from using a standard KL-UCB to a specialized version of KL- UCB when it determines that the arm reward is Markovist.

Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards

This paper modifications this policy and derive a finite-time regret bound for the new policy, Indexed Minimum Empirical Divergence (IMED), by refining large deviation probabilities to a simple nonasymptotic form and shows that IMED much improves DMED and performs competitively to other state-of-the-art policies.

KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints

This self-contained contribution simultaneously presents state-of-the-art techniques for regret minimization in bandit models, and an elementary construction of non-asymptotic confidence bounds based on the empirical likelihood method for bounded distributions.

Kullback–Leibler upper confidence bounds for optimal sequential allocation

The main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas and Katehakis (1996), respectively.

Hellinger KL-UCB based Bandit Algorithms for Markovian and i.i.d. Settings

This paper considers the problem of obtaining regret guarantees for MAB problems in which the rewards of each arm form a Markov chain which may not belong to a single parameter exponential family, and introduces a novel algorithm that identifies whether the rewards from each arm are truly Markovian or i.i.d. rewards.

A Closer Look at the Worst-case Behavior of Multi-armed Bandit Algorithms

It is shown that arm-sampling rates under UCB are asymptotically deterministic, regardless of the problem complexity, and the first complete process-level characterization of the MAB problem underUCB in the conventional diffusion scaling is provided.

Stochastic Rising Bandits

This paper designs an algorithm for the rested case and one for the restless case of the rested and restless bandits, providing a regret bound depending on the properties of the instance and, under certain circumstances, of r O p T 23 q .



An Asymptotically Optimal Bandit Algorithm for Bounded Support Models

Deterministic Minimum Empirical Divergence policy is proposed and proved that DMED achieves the asymptotic bound and the index used in DMED for choosing an arm can be computed easily by a convex optimization technique.

Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

  • R. Agrawal
  • Computer Science, Mathematics
    Advances in Applied Probability
  • 1995
This paper constructs index policies that depend on the rewards from each arm only through their sample mean, and achieves a O(log n) regret with a constant that is based on the Kullback–Leibler number.

Regret Bounds and Minimax Policies under Partial Monitoring

The stochastic bandit game is considered, and it is proved that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays.

Exploration-exploitation tradeoff using variance estimates in multi-armed bandits

Optimism in reinforcement learning and Kullback-Leibler divergence

It is proved that KL-UCRL provides the same guarantees as UCRL2 in terms of regret, however, numerical experiments on classical benchmarks show a significantly improved behavior, particularly when the MDP has reduced connectivity.

Finite-time Analysis of the Multiarmed Bandit Problem

This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences

A Kullback-Leibler-based algorithm for the stochastic multi-armed bandit problem in the case of distributions with finite supports is considered, whose asymptotic regret matches the lower bound of Burnetas96.

Optimal Adaptive Policies for Markov Decision Processes

This paper gives the explicit form for a class of adaptive policies that possess optimal increase rate properties for the total expected finite horizon reward, under sufficient assumptions of finite state-action spaces and irreducibility of the transition law.

Bandit processes and dynamic allocation indices

The paper aims to give a unified account of the central concepts in recent work on bandit processes and dynamic allocation indices; to show how these reduce some previously intractable problems to