GuideBoot: Guided Bootstrap for Deep Contextual Banditsin Online Advertising

  title={GuideBoot: Guided Bootstrap for Deep Contextual Banditsin Online Advertising},
  author={Feiyang Pan and Haoming Li and Xiang Ao and Wei Wang and Yanrong Kang and Ao Tan and Qingwei He},
  journal={Proceedings of the Web Conference 2021},
  • Feiyang Pan, Haoming Li, +4 authors Qingwei He
  • Published 19 April 2021
  • Computer Science
  • Proceedings of the Web Conference 2021
The exploration/exploitation (E&E) dilemma lies at the core of interactive systems such as online advertising, for which contextual bandit algorithms have been proposed. Bayesian approaches provide guided exploration via uncertainty estimation, but the applicability is often limited due to over-simplified assumptions. Non-Bayesian bootstrap methods, on the other hand, can apply to complex problems by using deep reward models, but lack a clear guidance to the exploration behavior. It still… Expand

Figures and Tables from this paper


Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling
This work benchmarks well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems and finds that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario. Expand
Thompson Sampling for Contextual Bandits with Linear Payoffs
A generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary is designed and analyzed. Expand
Personalized Recommendation via Parameter-Free Contextual Bandits
This work proposes a parameter-free bandit strategy, which employs a principled resampling approach called online bootstrap, to derive the distribution of estimated models in an online manner and demonstrates the effectiveness of the proposed algorithm in terms of the click-through rate. Expand
New Insights into Bootstrapping for Bandits
This work shows that the commonly used non-parametric bootstrapping (NPB) procedure can be provably inefficient and establishes a near-linear lower bound on the regret incurred by it under the bandit model with Bernoulli rewards, and proposes a weighted bootstrapped (WB) procedure. Expand
Contextual Gaussian Process Bandit Optimization
This work model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develops CGP-UCB, an intuitive upper-confidence style algorithm that shows that context-sensitive optimization outperforms no or naive use of context. Expand
A contextual-bandit approach to personalized news article recommendation
This work model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks. Expand
Ad click prediction: a view from the trenches
The goal of this paper is to highlight the close relationship between theoretical advances and practical engineering in this industrial setting, and to show the depth of challenges that appear when applying traditional machine learning methods in a complex dynamic system. Expand
Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits
A bandit algorithm that explores by randomizing its history of rewards by pulling the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards that easily generalizes to structured problems. Expand
Bootstrapped Thompson Sampling and Deep Exploration
This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. TheExpand
Finite-time Analysis of the Multiarmed Bandit Problem
This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support. Expand