# Neural Thompson Sampling

@article{Zhang2020NeuralTS, title={Neural Thompson Sampling}, author={Weitong Zhang and Dongruo Zhou and Lihong Li and Quanquan Gu}, journal={ArXiv}, year={2020}, volume={abs/2010.00827} }

Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network. We prove…

## 46 Citations

### EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits

- Computer ScienceICLR
- 2022

This paper proposes EE-Net, a novel neural exploration strategy in contextual bandits, distinct from the standard UCB-based and TS-based approaches, and proves that EE- net can achieve $\mathcal{O}(\sqrt{T\log T})$ regret and shows that EE -Net outperforms existing linear and neural contextual bandit baselines on real-world datasets.

### Learning Neural Contextual Bandits through Perturbed Rewards

- Computer ScienceICLR
- 2022

It is proved that a Õ(d̃ √ T ) regret upper bound is still achievable under standard regularity conditions, where T is the number of rounds of interactions and d̃ is the effective dimension of a neural tangent kernel matrix.

### Neural Exploitation and Exploration of Contextual Bandits

- Computer ScienceArXiv
- 2023

This paper proposes, ``EE-Net,'' a novel neural-based exploitation and exploration strategy that uses another neural network (Exploration network) to adaptively learn the potential gains compared to the currently estimated reward for exploration, and shows that EE-Net outperforms related linear and neural contextual bandit baselines on real-world datasets.

### Neural Contextual Bandits without Regret

- Computer ScienceAISTATS
- 2022

This work analyzes NTK-UCB, a kernelized bandit optimization algorithm employing the Neural Tangent Kernel, and bound its regret in terms of the NTK maximum information gain γ T, a complexity parameter capturing the diﬃculty of learning.

### Neural Contextual Bandits via Reward-Biased Maximum Likelihood Estimation

- Computer ScienceArXiv
- 2022

This paper proposes NeuralRBMLE, which adapts the RBMLE principle by adding a bias term to the log-likelihood to enforce exploration and achieves comparable or better empirical regrets than the state-of-the-art methods on realworld datasets with non-linear reward functions.

### An Empirical Study of Neural Kernel Bandits

- Computer ScienceArXiv
- 2021

It is proposed to directly apply NKinduced distributions to guide an upper confidence bound or Thompson samplingbased policy and it is shown that NK bandits achieve state-of-the-art performance on highly non-linear structured data.

### Maximum Entropy Exploration in Contextual Bandits with Neural Networks and Energy Based Models

- Computer ScienceEntropy
- 2023

Inspired by theories of human cognition, this work introduces novel techniques that use maximum entropy exploration, relying on neural networks to find optimal policies in settings with both continuous and discrete action spaces, and shows that both techniques outperform standard baseline algorithms.

### Langevin Monte Carlo for Contextual Bandits

- Computer ScienceICML
- 2022

This work proposes an efﬁcient posterior sampling algorithm, viz., Langevin Monte Carlo Thompson Sampling (LMC-TS), that uses Markov Chain Monte Carlo (MCMC) methods to directly sample from the posterior distribution in contextual bandits.

### Reward-Biased Maximum Likelihood Estimation for Neural Contextual Bandits

- Computer Science
- 2022

This paper proposes NeuralRBMLE, which adapts the RBMLE principle by adding a bias term to the log-likelihood to enforce exploration and achieves comparable or better empirical regrets than the state-of-the-art methods on real-world datasets with non-linear reward functions.

### Neural Contextual Bandits with Deep Representation and Shallow Exploration

- Computer ScienceICLR
- 2022

A novel learning algorithm is proposed that transforms the raw feature vector using the last hidden layer of a deep ReLU neural network, and uses an upper confidence bound (UCB) approach to explore in the last linear layer (shallow exploration).

## 61 References

### Neural Contextual Bandits with UCB-based Exploration

- Computer ScienceICML
- 2020

A new algorithm, NeuralUCB, is proposed, which leverages the representation power of deep neural networks and uses a neural network-based random feature mapping to construct an upper confidence bound (UCB) of reward for efficient exploration.

### Thompson Sampling for Contextual Bandits with Linear Payoffs

- Computer ScienceICML
- 2013

A generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary is designed and analyzed.

### Learning to Optimize via Posterior Sampling

- Computer ScienceMath. Oper. Res.
- 2014

A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards.

### Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

- Computer ScienceICLR
- 2018

This work benchmarks well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems and finds that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario.

### On Kernelized Multi-armed Bandits

- Computer ScienceICML
- 2017

This work provides two new Gaussian process-based algorithms for continuous bandit optimization-Improved GP-UCB and GP-Thomson sampling (GP-TS) and derive corresponding regret bounds, and derives a new self-normalized concentration inequality for vector- valued martingales of arbitrary, possibly infinite, dimension.

### Linear Thompson Sampling Revisited

- Computer ScienceAISTATS
- 2017

Thompson sampling can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrt{d}$ regret factor compared to a UCB-like approach.

### Bootstrapped Thompson Sampling and Deep Exploration

- Computer ScienceArXiv
- 2015

This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. The…

### A Tutorial on Thompson Sampling

- Computer ScienceFound. Trends Mach. Learn.
- 2018

This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes.

### Efficient Exploration Through Bayesian Deep Q-Networks

- Computer Science2018 Information Theory and Applications Workshop (ITA)
- 2018

Bayesian Deep Q-Network (BDQN), a practical Thompson sampling based Reinforcement Learning (RL) Algorithm, is proposed, which can be trained with fast closed-form updates and its samples can be drawn efficiently through the Gaussian distribution.

### Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound

- Computer ScienceICML
- 2020

These results are the first regret bounds that are near-optimal in time $T$ and dimension $d$ (or $\widetilde{d}$) and polynomial in the planning horizon $H$.