# Langevin Monte Carlo for Contextual Bandits

@inproceedings{Xu2022LangevinMC,
title={Langevin Monte Carlo for Contextual Bandits},
author={Pan Xu and Hongkai Zheng and Eric V. Mazumdar and Kamyar Azizzadenesheli and Anima Anandkumar},
booktitle={International Conference on Machine Learning},
year={2022}
}
• Published in
International Conference on…
22 June 2022
• Computer Science
We study the efﬁciency of Thompson sampling for contextual bandits. Existing Thompson sampling-based algorithms need to construct a Laplace approximation (i.e., a Gaussian distribution) of the posterior distribution, which is inefﬁcient to sample in high dimensional applications for general covariance matrices. Moreover, the Gaussian approximation may not be a good surrogate for the posterior distribution for general reward generating functions. We propose an efﬁcient posterior sampling…
3 Citations

## Figures and Tables from this paper

• Computer Science, Mathematics
ArXiv
• 2022
It is shown that HMC can sample from a distribution that is ε -close in total variation distance using (cid:101) O ( √ κd 1 / 4 log(1 /ε )) gradient queries, where κ is the condition number of Σ.
• Computer Science
ArXiv
• 2022
It is the case that greedy algorithms consistently outperform algorithms with e-cient exploration, such as Thompson sampling given enough timesteps which increase with the complexity of underlying features.
• Computer Science
Journal of Systems Science and Systems Engineering
• 2022
This paper applies the Thompson Sampling algorithm for the disjoint model, and provides a comprehensive regret analysis for a variant of the proposed algorithm that holds with probability 1 − δ under the mean-variance criterion with risk tolerance ρ.

## References

SHOWING 1-10 OF 50 REFERENCES

• Computer Science
ICML
• 2013
A generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary is designed and analyzed.
• Computer Science
ICML
• 2020
This work proposes two efficient Langevin MCMC algorithms tailored to Thompson sampling and derives novel posterior concentration bounds and MCMC convergence rates for logconcave distributions which may be of independent interest.
• Computer Science
ICLR
• 2021
This paper proposes a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation, with a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network.
• Computer Science
ICLR
• 2018
This work benchmarks well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems and finds that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario.
• Computer Science
ICML
• 2020
A new algorithm, NeuralUCB, is proposed, which leverages the representation power of deep neural networks and uses a neural network-based random feature mapping to construct an upper confidence bound (UCB) of reward for efficient exploration.
• Computer Science
ICML
• 2020
A new probabilistic modeling framework for Thompson sampling is proposed, where local latent variable uncertainty is used to sample the mean reward, and semi-implicit structure is further introduced to enhance its expressiveness.
• Computer Science
Math. Oper. Res.
• 2014
A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards.
• Computer Science
ICML
• 2011
In this paper we propose a new framework for learning from large scale datasets based on iterative learning from small mini-batches. By adding the right amount of noise to a standard stochastic
• Computer Science
NeurIPS
• 2019
It is shown that even small constant inference error can lead to poor performance (linear regret) due to under-exploration (for $\alpha 0$) by the approximation, but for $\alpha > 0$ this is unavoidable, and the regret can be improved by adding a small amount of forced exploration.