# Thompson sampling with the online bootstrap

@article{Eckles2014ThompsonSW, title={Thompson sampling with the online bootstrap}, author={Dean Eckles and Maurits Kaptein}, journal={ArXiv}, year={2014}, volume={abs/1410.4009} }

Thompson sampling provides a solution to bandit problems in which new observations are allocated to arms with the posterior probability that an arm is optimal. While sometimes easy to implement and asymptotically optimal, Thompson sampling can be computationally demanding in large scale bandit problems, and its performance is dependent on the model fit to the observed data. We introduce bootstrap Thompson sampling (BTS), a heuristic method for solving bandit problems which modifies Thompson…

## 44 Citations

Bootstrap Thompson Sampling and Sequential Decision Problems in the Behavioral Sciences

- EconomicsSAGE Open
- 2019

The utility of bootstrap Thompson sampling (BTS), which replaces the posterior distribution with the bootstrap distribution, is shown and its robustness to model misspecification is illustrated, which is a common concern in behavioral science applications.

New Insights into Bootstrapping for Bandits

- Computer ScienceArXiv
- 2018

This work shows that the commonly used non-parametric bootstrapping (NPB) procedure can be provably inefficient and establishes a near-linear lower bound on the regret incurred by it under the bandit model with Bernoulli rewards, and proposes a weighted bootstrapped (WB) procedure.

Residual Bootstrap Exploration for Stochastic Linear Bandit

- Computer ScienceArXiv
- 2022

A theoretical framework is contributed to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems and show the signiﬁcant computational e-ciency of LinReBoot.

Diffusion Approximations for Thompson Sampling

- Computer Science, MathematicsArXiv
- 2021

The weak convergence theory covers both the classical multi-armed and linear bandit settings, and can be used to obtain insight about the characteristics of the regret distribution when there is information sharing among arms, as well as the effects of variance estimation, model mis-specification and batched updates in bandit learning.

Bootstrapped Thompson Sampling and Deep Exploration

- Computer ScienceArXiv
- 2015

This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. The…

Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning

- Computer ScienceArXiv
- 2020

A novel imitation-learning-based algorithm is proposed that distills a TS policy into an explicit policy representation by performing posterior inference and optimization offline, which enables fast online decision-making and easy deployment in mobile and server-based environments.

Robust Contextual Bandits via Bootstrapping

- Computer ScienceAAAI
- 2021

An estimator (evaluated from historical rewards) is developed for the contextual bandit UCB based on the multiplier bootstrapping technique and it is proved that the BootLinUCB has a sub-linear regret upper bound and also conducted extensive experiments to validate its superior performance.

A Tutorial on Thompson Sampling

- Computer ScienceFound. Trends Mach. Learn.
- 2018

This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes.

Perturbed-History Exploration in Stochastic Linear Bandits

- Computer ScienceUAI
- 2019

A perturbed-history exploration in a linear bandit (LinPHE), estimates a linear model from its perturbed history and pulls the arm with the highest value under that model and proves a gap-free bound on the expected $n$-round regret of LinPHE, where $d$ is the number of features.

Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits

- Computer ScienceICML
- 2019

A bandit algorithm that explores by randomizing its history of rewards by pulling the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards that easily generalizes to structured problems.

## References

SHOWING 1-10 OF 51 REFERENCES

Approximate Bayesian-inference With the Weighted Likelihood Bootstrap

- Mathematics
- 1994

We introduce the weighted likelihood bootstrap (WLB) as a way to simulate approximately from a posterior distribution. This method is often easy to implement, requiring only an algorithm for…

Bootstrapping Regression Models

- Mathematics
- 2002

Bootstrapping is a general approach to statistical inference based on building a sampling distribution for a statistic by resampling from the data at hand. The term ‘bootstrapping,’ due to Efron…

A modern Bayesian look at the multi-armed bandit

- Computer Science
- 2010

A heuristic for managing multi-armed bandits called randomized probability matching is described, which randomly allocates observations to arms according the Bayesian posterior probability that each arm is optimal.

Exploration-exploitation tradeoff using variance estimates in multi-armed bandits

- Computer ScienceTheor. Comput. Sci.
- 2009

Bootstrapping data arrays of arbitrary order

- Mathematics
- 2012

In this paper we study a bootstrap strategy for estimating the variance of a mean taken over large multifactor crossed random effects data sets. We apply bootstrap reweighting independently to the…

An Empirical Evaluation of Thompson Sampling

- EconomicsNIPS
- 2011

Empirical results using Thompson sampling on simulated and real data are presented, and it is shown that it is highly competitive and should be part of the standard baselines to compare against.

Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis

- Computer Science, MathematicsALT
- 2012

The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem is answered positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret.

Bandit problems and the exploration/exploitation tradeoff

- Computer ScienceIEEE Trans. Evol. Comput.
- 1998

An analytically simple bandit model is provided that is more directly applicable to optimization theory than the traditional bandit problem and a near-optimal strategy is determined for that model.

The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond

- Computer ScienceCOLT
- 2011

It is proved that for arbitrary bounded rewards, the KL-UCB algorithm satisfies a uniformly better regret bound than UCB or UCB2; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins.