# Bootstrapped Thompson Sampling and Deep Exploration

@article{Osband2015BootstrappedTS, title={Bootstrapped Thompson Sampling and Deep Exploration}, author={Ian Osband and Benjamin Van Roy}, journal={ArXiv}, year={2015}, volume={abs/1507.00300} }

This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. The approach is based on a bootstrap technique that uses a combination of observed and artificially generated data. The latter serves to induce a prior distribution which, as we will demonstrate, is critical to effective exploration. We explain how the approach can be applied to multi-armed bandit and…

## 80 Citations

### Randomized Prior Functions for Deep Reinforcement Learning

- Computer ScienceNeurIPS
- 2018

It is shown that this approach is efficient with linear representations, provides simple illustrations of its efficacy with nonlinear representations and scales to large-scale problems far better than previous attempts.

### State-Aware Variational Thompson Sampling for Deep Q-Networks

- Computer ScienceAAMAS
- 2021

A variational Thompson sampling approximation for DQNs which uses a deep network whose parameters are perturbed by a learned variational noise distribution, and hypothesize that such state-aware noisy exploration is particularly useful in problems where exploration in certain high risk states may result in the agent failing badly.

### Neural Thompson Sampling

- Computer ScienceICLR
- 2021

This paper proposes a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation, with a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network.

### Improving the Diversity of Bootstrapped DQN by Replacing Priors With Noise

- Computer ScienceIEEE Transactions on Games
- 2022

—Q-learning is one of the most well-known Reinforce- ment Learning algorithms. There have been tremendous efforts to develop this algorithm using neural networks. Bootstrapped Deep Q-Learning Network…

### Improving the Diversity of Bootstrapped DQN via Noisy Priors

- Computer ScienceArXiv
- 2022

The possibility of treating priors as a special type of noise and sample priors from a Gaussian distribution to introduce more diversity into Bootstrapped Deep Q- learning is explored.

### (Sequential) Importance Sampling Bandits

- Computer ScienceArXiv
- 2018

This work extends existing multi-armed bandit algorithms beyond their original settings by leveraging advances in sequential Monte Carlo (SMC) methods from the approximate inference community and the flexibility of (sequential) importance sampling to allow for accurate estimation of the statistics of interest within the MAB problem.

### Deep Exploration via Bootstrapped DQN

- Computer ScienceNIPS
- 2016

Efficient exploration in complex environments remains a major challenge for reinforcement learning. We propose bootstrapped DQN, a simple algorithm that explores in a computationally and…

### Debiasing Samples from Online Learning Using Bootstrap

- Computer ScienceAISTATS
- 2022

This paper provides a procedure to debias the samples using bootstrap, which doesn’t require the knowledge of the reward distribution and can be applied to any adaptive policies.

### BooVI: Provably Efficient Bootstrapped Value Iteration

- Computer ScienceNeurIPS
- 2021

A variant of bootstrapped LSVI, namely BooVI, is developed, which bridges such a gap between practice and theory, making it compatible with general function approximators.

### Practical Evaluation and Optimization of Contextual Bandit Algorithms

- Computer ScienceArXiv
- 2018

We study and empirically optimize contextual bandit learning, exploration, and problem encodings across 500+ datasets, creating a reference for practitioners and discovering or reinforcing a number…

## References

SHOWING 1-10 OF 17 REFERENCES

### Thompson sampling with the online bootstrap

- Economics, Computer ScienceArXiv
- 2014

This work explains BTS and shows that the performance of BTS is competitive to Thompson sampling in the well-studied Bernoulli bandit case, and details why BTS using the online bootstrap is more scalable than regular Thompson sampling.

### Learning to Optimize via Posterior Sampling

- Computer ScienceMath. Oper. Res.
- 2014

A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards.

### (More) Efficient Reinforcement Learning via Posterior Sampling

- Computer ScienceNIPS
- 2013

An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.

### Generalization and Exploration via Randomized Value Functions

- Computer ScienceICML
- 2016

The results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.

### Sub-sampling for Multi-armed Bandits

- Computer ScienceECML/PKDD
- 2014

A novel algorithm that is based on sub-sampling that demonstrates excellent empirical performances against state-of-the-art algorithms, including Thompson sampling and KL-UCB is introduced.

### Further Optimal Regret Bounds for Thompson Sampling

- Computer ScienceAISTATS
- 2013

A novel regret analysis for Thompson Sampling is provided that proves the first near-optimal problem-independent bound of O( √ NT lnT ) on the expected regret of this algorithm, and simultaneously provides the optimal problem-dependent bound.

### Model-based Reinforcement Learning and the Eluder Dimension

- Computer ScienceNIPS
- 2014

This work shows that, if the MDP can be parameterized within some known function class, it can obtain regret bounds that scale with the dimensionality, rather than cardinality, of the system.

### Near-optimal Reinforcement Learning in Factored MDPs

- Computer ScienceNIPS
- 2014

It is established that, if the system is known to be a factored MDP, it is possible to achieve regret that scales polynomially in the number of parameters encoding the factoredMDP, which may be exponentially smaller than S or A.

### R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

- Computer ScienceJ. Mach. Learn. Res.
- 2002

R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time and formally justifies the ``optimism under uncertainty'' bias used in many RL algorithms.

### Human-level control through deep reinforcement learning

- Computer ScienceNature
- 2015

This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.