Langevin Monte Carlo for Contextual Bandits

  title={Langevin Monte Carlo for Contextual Bandits},
  author={Pan Xu and Hongkai Zheng and Eric V. Mazumdar and Kamyar Azizzadenesheli and Anima Anandkumar},
  booktitle={International Conference on Machine Learning},
We study the efficiency of Thompson sampling for contextual bandits. Existing Thompson sampling-based algorithms need to construct a Laplace approximation (i.e., a Gaussian distribution) of the posterior distribution, which is inefficient to sample in high dimensional applications for general covariance matrices. Moreover, the Gaussian approximation may not be a good surrogate for the posterior distribution for general reward generating functions. We propose an efficient posterior sampling… 

Figures and Tables from this paper

Hamiltonian Monte Carlo for efficient Gaussian sampling: long and random steps

It is shown that HMC can sample from a distribution that is ε -close in total variation distance using (cid:101) O ( √ κd 1 / 4 log(1 /ε )) gradient queries, where κ is the condition number of Σ.

Ungeneralizable Contextual Logistic Bandit in Credit Scoring

It is the case that greedy algorithms consistently outperform algorithms with e-cient exploration, such as Thompson sampling given enough timesteps which increase with the complexity of underlying features.

Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs

This paper applies the Thompson Sampling algorithm for the disjoint model, and provides a comprehensive regret analysis for a variant of the proposed algorithm that holds with probability 1 − δ under the mean-variance criterion with risk tolerance ρ.



Thompson Sampling for Contextual Bandits with Linear Payoffs

A generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary is designed and analyzed.

On Approximate Thompson Sampling with Langevin Algorithms

This work proposes two efficient Langevin MCMC algorithms tailored to Thompson sampling and derives novel posterior concentration bounds and MCMC convergence rates for logconcave distributions which may be of independent interest.

Neural Thompson Sampling

This paper proposes a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation, with a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network.

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

This work benchmarks well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems and finds that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario.

Neural Contextual Bandits with UCB-based Exploration

A new algorithm, NeuralUCB, is proposed, which leverages the representation power of deep neural networks and uses a neural network-based random feature mapping to construct an upper confidence bound (UCB) of reward for efficient exploration.

Thompson Sampling via Local Uncertainty

A new probabilistic modeling framework for Thompson sampling is proposed, where local latent variable uncertainty is used to sample the mean reward, and semi-implicit structure is further introduced to enhance its expressiveness.

Learning to Optimize via Posterior Sampling

A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards.

Bayesian Learning via Stochastic Gradient Langevin Dynamics

In this paper we propose a new framework for learning from large scale datasets based on iterative learning from small mini-batches. By adding the right amount of noise to a standard stochastic

User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient

Thompson Sampling and Approximate Inference

It is shown that even small constant inference error can lead to poor performance (linear regret) due to under-exploration (for $\alpha 0$) by the approximation, but for $\alpha > 0$ this is unavoidable, and the regret can be improved by adding a small amount of forced exploration.