Continuous-in-time Limit for Bayesian Bandits

@article{Zhu2022ContinuousintimeLF,
  title={Continuous-in-time Limit for Bayesian Bandits},
  author={Yuhua Zhu and Zachary Izzo and Lexing Ying},
  journal={ArXiv},
  year={2022},
  volume={abs/2210.07513}
}
This paper revisits the bandit problem in the Bayesian setting. The Bayesian approach formulates the bandit problem as an optimization problem, and the goal is to find the optimal policy which minimizes the Bayesian regret. One of the main challenges facing the Bayesian approach is that computation of the optimal policy is often intractable, especially when the length of the problem horizon or the number of arms is large. In this paper, we first show that under a suitable rescaling, the Bayesian… 

Figures from this paper

A PDE-Based Analysis of the Symmetric Two-Armed Bernoulli Bandit

This work explicitly compute the leading order term of the optimal regret and pseudoregret in three different scaling regimes for the gap in a regime where the gap between these means goes to zero and the number of prediction periods approaches infinity.

References

SHOWING 1-10 OF 28 REFERENCES

Bandit Algorithms

sets of environments and policies respectively and ` : E ×Π→ [0, 1] a bounded loss function. Given a policy π let `(π) = (`(ν1, π), . . . , `(νN , π)) be the loss vector resulting from policy π.

A New Approach to Drifting Games, Based on Asymptotically Optimal Potentials

A new approach to drifting games, a class of two-person games with many applications to boosting and online learning settings, including Prediction with Expert Advice and the Hedge game, is developed, which gives new potentials and derive corresponding upper and lower bounds that match each other in the asymptotic regime.

A PDE-Based Analysis of the Symmetric Two-Armed Bernoulli Bandit

This work explicitly compute the leading order term of the optimal regret and pseudoregret in three different scaling regimes for the gap in a regime where the gap between these means goes to zero and the number of prediction periods approaches infinity.

Diffusion Approximations for Thompson Sampling

The weak convergence theory covers both the classical multi-armed and linear bandit settings, and can be used to obtain insight about the characteristics of the regret distribution when there is information sharing among arms, as well as the effects of variance estimation, model mis-specification and batched updates in bandit learning.

Diffusion Asymptotics for Sequential Experiments

This work proposes a new diffusion-asymptotic analysis for sequentially randomized experiments that lets the mean signal level scale to the order 1/ √ n so as to preserve the difficulty of the learning task as n gets large.

A Note on Optimization Formulations of Markov Decision Processes

This note summarizes the optimization formulations used in the study of Markov decision processes. We consider both the discounted and undiscounted processes under the standard and the

Diffusion Approximations for a Class of Sequential Experimentation Problems

A diffusion approximation is derived for the sequential experimentation problem of a seller who wants to select an optimal assortment of products to launch into the marketplace and is uncertain about consumers’ preferences to demonstrate the effectiveness and robustness of the heuristics derived from the diffusion approximation.

Sequential Procurement with Contractual and Experimental Learning

The effect strategic sellers have on the buyer's optimal strategy relative to more traditional learning dynamics is identified, and it is established that, paradoxically, when sellers are strategic, the ability to observe delivered quality is not always beneficial for the buyer.

Recommender Systems as Mechanisms for Social Learning

This article studies how a recommender system may incentivize users to learn about a product collaboratively and “seed” incentives for user exploration and determine the speed and trajectory of social learning.