A Tutorial on Thompson Sampling

@article{Russo2018ATO,
  title={A Tutorial on Thompson Sampling},
  author={Daniel Russo and Benjamin Van Roy and Abbas Kazerouni and Ian Osband},
  journal={ArXiv},
  year={2018},
  volume={abs/1707.02038}
}
Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of… 
Thompson Sampling via Local Uncertainty
TLDR
A new probabilistic modeling framework for Thompson sampling is proposed, where local latent variable uncertainty is used to sample the mean reward, and semi-implicit structure is further introduced to enhance its expressiveness.
TSEC: a framework for online experimentation under experimental constraints
TLDR
A new Thompson Sampling under Experimental Constraints (TSEC) method, which addresses this so-called "arm budget constraint" and makes use of a Bayesian interaction model with effect hierarchy priors, to model correlations between rewards on different arms.
Satisficing in Time-Sensitive Bandit Learning
TLDR
A general bound on expected discounted regret is established and the application of satisficing Thompson sampling to linear and infinite-armed bandits is studied, demonstrating arbitrarily large benefits over Thompson sampling.
Thompson Sampling for the MNL-Bandit
TLDR
An approach to adapt Thompson Sampling to this problem is presented and it is shown that it achieves near-optimal regret as well as attractive numerical performance.
Neural Thompson Sampling
TLDR
This paper proposes a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation, with a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network.
(Sequential) Importance Sampling Bandits
TLDR
This work extends existing multi-armed bandit algorithms beyond their original settings by leveraging advances in sequential Monte Carlo (SMC) methods from the approximate inference community and the flexibility of (sequential) importance sampling to allow for accurate estimation of the statistics of interest within the MAB problem.
Collaborative Thompson Sampling
TLDR
This work presents collaborative Thompson sampling to apply the exploration-exploitation strategy to highly dynamic settings and shows accelerated convergence and improved prediction performance in collaborative environments.
Sequential Decision Making with Combinatorial Actions and High-Dimensional Contexts
TLDR
An efficient sparse contextual bandit algorithm is designed that does not require to know the sparsity of the underlying parameter – information that essentially all existing sparse bandit algorithms to date require.
Analysis of Thompson Sampling for Partially Observable Contextual Multi-Armed Bandits
TLDR
This work proposes a Thompson Sampling algorithm for partially observable contextual multi-armed bandits, and establish theoretical performance guarantees, and establishes rates of learning unknown parameters.
...
...

References

SHOWING 1-10 OF 89 REFERENCES
Learning to Optimize via Information-Directed Sampling
TLDR
An expected regret bound for information-directed sampling is established that applies across a very general class of models and scales with the entropy of the optimal action distribution.
Learning to Optimize via Posterior Sampling
TLDR
A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards.
Time-Sensitive Bandit Learning and Satisficing Thompson Sampling
TLDR
This paper proposes satisficing Thompson sampling -- a variation ofThompson sampling -- and establishes a strong discounted regret bound for this new algorithm, where a discount factor encodes time preference.
Thompson Sampling for the MNL-Bandit
TLDR
An approach to adapt Thompson Sampling to this problem is presented and it is shown that it achieves near-optimal regret as well as attractive numerical performance.
(More) Efficient Reinforcement Learning via Posterior Sampling
TLDR
An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.
Ensemble Sampling
TLDR
Ensemble sampling is developed, which aims to approximate Thompson sampling while maintaining tractability even in the face of complex models such as neural networks.
Why is Posterior Sampling Better than Optimism for Reinforcement Learning?
TLDR
An Bayesian expected regret bound for PSRL in finite-horizon episodic Markov decision processes is established, which improves upon the best previous bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm.
The Knowledge-Gradient Policy for Correlated Normal Beliefs
TLDR
A fully sequential sampling policy is proposed called the knowledge-gradient policy, which is provably optimal in some special cases and has bounded suboptimality in all others and it is demonstrated how this policy may be applied to efficiently maximize a continuous function on a continuous domain while constrained to a fixed number of noisy measurements.
Thompson Sampling for Learning Parameterized Markov Decision Processes
TLDR
It is shown that the number of instants where suboptimal actions are chosen scales logarithmically with time, with high probability, and a frequentist regret bound for priors over general parameter spaces is derived.
Learning Unknown Markov Decision Processes: A Thompson Sampling Approach
TLDR
A Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE) that generates a sample from the posterior distribution over the unknown model parameters at the beginning of each episode and follows the optimal stationary policy for the sampled model for the rest of the episode.
...
...