• Corpus ID: 6207952

Thompson sampling with the online bootstrap

@article{Eckles2014ThompsonSW,
  title={Thompson sampling with the online bootstrap},
  author={Dean Eckles and Maurits Kaptein},
  journal={ArXiv},
  year={2014},
  volume={abs/1410.4009}
}
Thompson sampling provides a solution to bandit problems in which new observations are allocated to arms with the posterior probability that an arm is optimal. While sometimes easy to implement and asymptotically optimal, Thompson sampling can be computationally demanding in large scale bandit problems, and its performance is dependent on the model fit to the observed data. We introduce bootstrap Thompson sampling (BTS), a heuristic method for solving bandit problems which modifies Thompson… 

Figures from this paper

Bootstrap Thompson Sampling and Sequential Decision Problems in the Behavioral Sciences
TLDR
The utility of bootstrap Thompson sampling (BTS), which replaces the posterior distribution with the bootstrap distribution, is shown and its robustness to model misspecification is illustrated, which is a common concern in behavioral science applications.
New Insights into Bootstrapping for Bandits
TLDR
This work shows that the commonly used non-parametric bootstrapping (NPB) procedure can be provably inefficient and establishes a near-linear lower bound on the regret incurred by it under the bandit model with Bernoulli rewards, and proposes a weighted bootstrapped (WB) procedure.
Residual Bootstrap Exploration for Stochastic Linear Bandit
TLDR
A theoretical framework is contributed to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems and show the significant computational e-ciency of LinReBoot.
Diffusion Approximations for Thompson Sampling
TLDR
The weak convergence theory covers both the classical multi-armed and linear bandit settings, and can be used to obtain insight about the characteristics of the regret distribution when there is information sharing among arms, as well as the effects of variance estimation, model mis-specification and batched updates in bandit learning.
Bootstrapped Thompson Sampling and Deep Exploration
This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. The
Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning
TLDR
A novel imitation-learning-based algorithm is proposed that distills a TS policy into an explicit policy representation by performing posterior inference and optimization offline, which enables fast online decision-making and easy deployment in mobile and server-based environments.
Robust Contextual Bandits via Bootstrapping
TLDR
An estimator (evaluated from historical rewards) is developed for the contextual bandit UCB based on the multiplier bootstrapping technique and it is proved that the BootLinUCB has a sub-linear regret upper bound and also conducted extensive experiments to validate its superior performance.
A Tutorial on Thompson Sampling
TLDR
This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes.
Perturbed-History Exploration in Stochastic Linear Bandits
TLDR
A perturbed-history exploration in a linear bandit (LinPHE), estimates a linear model from its perturbed history and pulls the arm with the highest value under that model and proves a gap-free bound on the expected $n$-round regret of LinPHE, where $d$ is the number of features.
Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits
TLDR
A bandit algorithm that explores by randomizing its history of rewards by pulling the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards that easily generalizes to structured problems.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 51 REFERENCES
Approximate Bayesian-inference With the Weighted Likelihood Bootstrap
We introduce the weighted likelihood bootstrap (WLB) as a way to simulate approximately from a posterior distribution. This method is often easy to implement, requiring only an algorithm for
Bootstrapping Regression Models
Bootstrapping is a general approach to statistical inference based on building a sampling distribution for a statistic by resampling from the data at hand. The term ‘bootstrapping,’ due to Efron
A modern Bayesian look at the multi-armed bandit
TLDR
A heuristic for managing multi-armed bandits called randomized probability matching is described, which randomly allocates observations to arms according the Bayesian posterior probability that each arm is optimal.
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits
Bootstrapping data arrays of arbitrary order
In this paper we study a bootstrap strategy for estimating the variance of a mean taken over large multifactor crossed random effects data sets. We apply bootstrap reweighting independently to the
An Empirical Evaluation of Thompson Sampling
TLDR
Empirical results using Thompson sampling on simulated and real data are presented, and it is shown that it is highly competitive and should be part of the standard baselines to compare against.
Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis
TLDR
The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem is answered positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret.
Bandit problems and the exploration/exploitation tradeoff
TLDR
An analytically simple bandit model is provided that is more directly applicable to optimization theory than the traditional bandit problem and a near-optimal strategy is determined for that model.
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
TLDR
It is proved that for arbitrary bounded rewards, the KL-UCB algorithm satisfies a uniformly better regret bound than UCB or UCB2; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins.
...
1
2
3
4
5
...