• Corpus ID: 16385631

Bootstrapped Thompson Sampling and Deep Exploration

@article{Osband2015BootstrappedTS,
  title={Bootstrapped Thompson Sampling and Deep Exploration},
  author={Ian Osband and Benjamin Van Roy},
  journal={ArXiv},
  year={2015},
  volume={abs/1507.00300}
}
This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. The approach is based on a bootstrap technique that uses a combination of observed and artificially generated data. The latter serves to induce a prior distribution which, as we will demonstrate, is critical to effective exploration. We explain how the approach can be applied to multi-armed bandit and… 

Figures from this paper

Randomized Prior Functions for Deep Reinforcement Learning

It is shown that this approach is efficient with linear representations, provides simple illustrations of its efficacy with nonlinear representations and scales to large-scale problems far better than previous attempts.

State-Aware Variational Thompson Sampling for Deep Q-Networks

A variational Thompson sampling approximation for DQNs which uses a deep network whose parameters are perturbed by a learned variational noise distribution, and hypothesize that such state-aware noisy exploration is particularly useful in problems where exploration in certain high risk states may result in the agent failing badly.

Neural Thompson Sampling

This paper proposes a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation, with a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network.

Improving the Diversity of Bootstrapped DQN by Replacing Priors With Noise

—Q-learning is one of the most well-known Reinforce- ment Learning algorithms. There have been tremendous efforts to develop this algorithm using neural networks. Bootstrapped Deep Q-Learning Network

Improving the Diversity of Bootstrapped DQN via Noisy Priors

The possibility of treating priors as a special type of noise and sample priors from a Gaussian distribution to introduce more diversity into Bootstrapped Deep Q- learning is explored.

(Sequential) Importance Sampling Bandits

This work extends existing multi-armed bandit algorithms beyond their original settings by leveraging advances in sequential Monte Carlo (SMC) methods from the approximate inference community and the flexibility of (sequential) importance sampling to allow for accurate estimation of the statistics of interest within the MAB problem.

Deep Exploration via Bootstrapped DQN

Efficient exploration in complex environments remains a major challenge for reinforcement learning. We propose bootstrapped DQN, a simple algorithm that explores in a computationally and

Debiasing Samples from Online Learning Using Bootstrap

This paper provides a procedure to debias the samples using bootstrap, which doesn’t require the knowledge of the reward distribution and can be applied to any adaptive policies.

BooVI: Provably Efficient Bootstrapped Value Iteration

A variant of bootstrapped LSVI, namely BooVI, is developed, which bridges such a gap between practice and theory, making it compatible with general function approximators.

Practical Evaluation and Optimization of Contextual Bandit Algorithms

We study and empirically optimize contextual bandit learning, exploration, and problem encodings across 500+ datasets, creating a reference for practitioners and discovering or reinforcing a number
...

References

SHOWING 1-10 OF 17 REFERENCES

Thompson sampling with the online bootstrap

This work explains BTS and shows that the performance of BTS is competitive to Thompson sampling in the well-studied Bernoulli bandit case, and details why BTS using the online bootstrap is more scalable than regular Thompson sampling.

Learning to Optimize via Posterior Sampling

A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards.

(More) Efficient Reinforcement Learning via Posterior Sampling

An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.

Generalization and Exploration via Randomized Value Functions

The results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.

Sub-sampling for Multi-armed Bandits

A novel algorithm that is based on sub-sampling that demonstrates excellent empirical performances against state-of-the-art algorithms, including Thompson sampling and KL-UCB is introduced.

Further Optimal Regret Bounds for Thompson Sampling

A novel regret analysis for Thompson Sampling is provided that proves the first near-optimal problem-independent bound of O( √ NT lnT ) on the expected regret of this algorithm, and simultaneously provides the optimal problem-dependent bound.

Model-based Reinforcement Learning and the Eluder Dimension

This work shows that, if the MDP can be parameterized within some known function class, it can obtain regret bounds that scale with the dimensionality, rather than cardinality, of the system.

Near-optimal Reinforcement Learning in Factored MDPs

It is established that, if the system is known to be a factored MDP, it is possible to achieve regret that scales polynomially in the number of parameters encoding the factoredMDP, which may be exponentially smaller than S or A.

R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time and formally justifies the ``optimism under uncertainty'' bias used in many RL algorithms.

Human-level control through deep reinforcement learning

This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.