Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits
@article{Jin2022FiniteTimeRO, title={Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits}, author={Tianyuan Jin and Pan Xu and X. Xiao and Anima Anandkumar}, journal={ArXiv}, year={2022}, volume={abs/2206.03520} }
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the…
References
SHOWING 1-10 OF 35 REFERENCES
Further Optimal Regret Bounds for Thompson Sampling
- Computer ScienceAISTATS
- 2013
A novel regret analysis for Thompson Sampling is provided that proves the first near-optimal problem-independent bound of O( √ NT lnT ) on the expected regret of this algorithm, and simultaneously provides the optimal problem-dependent bound.
Thompson Sampling for 1-Dimensional Exponential Family Bandits
- Computer Science, MathematicsNIPS
- 2013
This work proves asymptotic optimality of theThompson Sampling algorithm using the Jeffreys prior using closed forms for Kullback-Leibler divergence and Fisher information available in an exponential family, to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right.
Prior-free and prior-dependent regret bounds for Thompson Sampling
- Mathematics, Computer Science2014 48th Annual Conference on Information Sciences and Systems (CISS)
- 2014
It is shown that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by 14√nK, and that in the case of priors for the setting of Bubeck et al.
Thompson Sampling for Combinatorial Semi-Bandits
- Computer ScienceICML
- 2018
The first distribution-dependent regret bound of O(mK_{\max}\log T / \Delta_{\min}) is obtained, and it is shown that one cannot directly replace the exact offline oracle with an approximation oracle in TS algorithm for even the classical MAB problem.
Linear Thompson Sampling Revisited
- Computer ScienceAISTATS
- 2017
Thompson sampling can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrt{d}$ regret factor compared to a UCB-like approach.
MOTS: Minimax Optimal Thompson Sampling
- Computer ScienceICML
- 2021
MOTS is the first Thompson sampling type algorithm that achieves minimax optimality for multi-armed bandit problems by proposing a variant of Thompson sampling called MOTS that adaptively clips the sampling result of the chosen arm at each time step.
Doubly Robust Thompson Sampling for linear payoffs
- Computer ScienceNeurIPS
- 2021
A novel multi-armed contextual bandit algorithm employing the doubly-robust estimator used in missing data literature to Thompson Sampling with contexts ( LinTS) and improving the bound of LinTS by a factor of √ d is proposed.
KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints
- Computer ScienceArXiv
- 2018
This self-contained contribution simultaneously presents state-of-the-art techniques for regret minimization in bandit models, and an elementary construction of non-asymptotic confidence bounds based on the empirical likelihood method for bounded distributions.
Finite-time Analysis of the Multiarmed Bandit Problem
- Computer ScienceMachine Learning
- 2004
This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
- Computer ScienceCOLT
- 2011
It is proved that for arbitrary bounded rewards, the KL-UCB algorithm satisfies a uniformly better regret bound than UCB or UCB2; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins.