Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits

  title={Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits},
  author={Tianyuan Jin and Pan Xu and X. Xiao and Anima Anandkumar},
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the… 

Tables from this paper



Further Optimal Regret Bounds for Thompson Sampling

A novel regret analysis for Thompson Sampling is provided that proves the first near-optimal problem-independent bound of O( √ NT lnT ) on the expected regret of this algorithm, and simultaneously provides the optimal problem-dependent bound.

Thompson Sampling for 1-Dimensional Exponential Family Bandits

This work proves asymptotic optimality of theThompson Sampling algorithm using the Jeffreys prior using closed forms for Kullback-Leibler divergence and Fisher information available in an exponential family, to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right.

Prior-free and prior-dependent regret bounds for Thompson Sampling

It is shown that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by 14√nK, and that in the case of priors for the setting of Bubeck et al.

Thompson Sampling for Combinatorial Semi-Bandits

The first distribution-dependent regret bound of O(mK_{\max}\log T / \Delta_{\min}) is obtained, and it is shown that one cannot directly replace the exact offline oracle with an approximation oracle in TS algorithm for even the classical MAB problem.

Linear Thompson Sampling Revisited

Thompson sampling can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrt{d}$ regret factor compared to a UCB-like approach.

MOTS: Minimax Optimal Thompson Sampling

MOTS is the first Thompson sampling type algorithm that achieves minimax optimality for multi-armed bandit problems by proposing a variant of Thompson sampling called MOTS that adaptively clips the sampling result of the chosen arm at each time step.

Doubly Robust Thompson Sampling for linear payoffs

A novel multi-armed contextual bandit algorithm employing the doubly-robust estimator used in missing data literature to Thompson Sampling with contexts ( LinTS) and improving the bound of LinTS by a factor of √ d is proposed.

KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints

This self-contained contribution simultaneously presents state-of-the-art techniques for regret minimization in bandit models, and an elementary construction of non-asymptotic confidence bounds based on the empirical likelihood method for bounded distributions.

Finite-time Analysis of the Multiarmed Bandit Problem

This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond

It is proved that for arbitrary bounded rewards, the KL-UCB algorithm satisfies a uniformly better regret bound than UCB or UCB2; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins.