# A Tutorial on Thompson Sampling

@article{Russo2018ATO, title={A Tutorial on Thompson Sampling}, author={Daniel Russo and Benjamin Van Roy and Abbas Kazerouni and Ian Osband}, journal={ArXiv}, year={2018}, volume={abs/1707.02038} }

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of…

## Figures and Tables from this paper

## 444 Citations

An empirical evaluation of active inference in multi-armed bandits

- Computer ScienceNeural Networks
- 2021

Thompson Sampling via Local Uncertainty

- Computer ScienceICML
- 2020

A new probabilistic modeling framework for Thompson sampling is proposed, where local latent variable uncertainty is used to sample the mean reward, and semi-implicit structure is further introduced to enhance its expressiveness.

TSEC: a framework for online experimentation under experimental constraints

- Computer ScienceArXiv
- 2021

A new Thompson Sampling under Experimental Constraints (TSEC) method, which addresses this so-called "arm budget constraint" and makes use of a Bayesian interaction model with effect hierarchy priors, to model correlations between rewards on different arms.

Satisficing in Time-Sensitive Bandit Learning

- Computer ScienceMathematics of Operations Research
- 2022

A general bound on expected discounted regret is established and the application of satisficing Thompson sampling to linear and infinite-armed bandits is studied, demonstrating arbitrarily large benefits over Thompson sampling.

Thompson Sampling for the MNL-Bandit

- Mathematics, Computer ScienceCOLT
- 2017

An approach to adapt Thompson Sampling to this problem is presented and it is shown that it achieves near-optimal regret as well as attractive numerical performance.

Neural Thompson Sampling

- Computer ScienceICLR
- 2021

This paper proposes a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation, with a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network.

(Sequential) Importance Sampling Bandits

- Computer ScienceArXiv
- 2018

This work extends existing multi-armed bandit algorithms beyond their original settings by leveraging advances in sequential Monte Carlo (SMC) methods from the approximate inference community and the flexibility of (sequential) importance sampling to allow for accurate estimation of the statistics of interest within the MAB problem.

Collaborative Thompson Sampling

- Computer ScienceCollaborateCom
- 2018

This work presents collaborative Thompson sampling to apply the exploration-exploitation strategy to highly dynamic settings and shows accelerated convergence and improved prediction performance in collaborative environments.

Sequential Decision Making with Combinatorial Actions and High-Dimensional Contexts

- Computer Science
- 2020

An efficient sparse contextual bandit algorithm is designed that does not require to know the sparsity of the underlying parameter – information that essentially all existing sparse bandit algorithms to date require.

Analysis of Thompson Sampling for Partially Observable Contextual Multi-Armed Bandits

- Computer ScienceIEEE Control Systems Letters
- 2021

This work proposes a Thompson Sampling algorithm for partially observable contextual multi-armed bandits, and establish theoretical performance guarantees, and establishes rates of learning unknown parameters.

## References

SHOWING 1-10 OF 89 REFERENCES

Learning to Optimize via Information-Directed Sampling

- Computer ScienceNIPS
- 2014

An expected regret bound for information-directed sampling is established that applies across a very general class of models and scales with the entropy of the optimal action distribution.

Learning to Optimize via Posterior Sampling

- Computer ScienceMath. Oper. Res.
- 2014

A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards.

Time-Sensitive Bandit Learning and Satisficing Thompson Sampling

- Computer ScienceArXiv
- 2017

This paper proposes satisficing Thompson sampling -- a variation ofThompson sampling -- and establishes a strong discounted regret bound for this new algorithm, where a discount factor encodes time preference.

Thompson Sampling for the MNL-Bandit

- Mathematics, Computer ScienceCOLT
- 2017

An approach to adapt Thompson Sampling to this problem is presented and it is shown that it achieves near-optimal regret as well as attractive numerical performance.

(More) Efficient Reinforcement Learning via Posterior Sampling

- Computer ScienceNIPS
- 2013

An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.

Ensemble Sampling

- Computer ScienceNIPS
- 2017

Ensemble sampling is developed, which aims to approximate Thompson sampling while maintaining tractability even in the face of complex models such as neural networks.

Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

- Computer ScienceICML
- 2017

An Bayesian expected regret bound for PSRL in finite-horizon episodic Markov decision processes is established, which improves upon the best previous bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm.

The Knowledge-Gradient Policy for Correlated Normal Beliefs

- Computer Science, MathematicsINFORMS J. Comput.
- 2009

A fully sequential sampling policy is proposed called the knowledge-gradient policy, which is provably optimal in some special cases and has bounded suboptimality in all others and it is demonstrated how this policy may be applied to efficiently maximize a continuous function on a continuous domain while constrained to a fixed number of noisy measurements.

Thompson Sampling for Learning Parameterized Markov Decision Processes

- Computer ScienceCOLT
- 2015

It is shown that the number of instants where suboptimal actions are chosen scales logarithmically with time, with high probability, and a frequentist regret bound for priors over general parameter spaces is derived.

Learning Unknown Markov Decision Processes: A Thompson Sampling Approach

- Computer ScienceNIPS
- 2017

A Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE) that generates a sample from the posterior distribution over the unknown model parameters at the beginning of each episode and follows the optimal stationary policy for the sampled model for the rest of the episode.