• Corpus ID: 235828732

No Regrets for Learning the Prior in Bandits

  title={No Regrets for Learning the Prior in Bandits},
  author={Soumya Sankar Basu and Branislav Kveton and Manzil Zaheer and Csaba Szepesvari},
We propose AdaTS , a Thompson sampling algorithm that adapts sequentially to bandit tasks that it interacts with. The key idea in AdaTS is to adapt to an unknown task prior distribution by maintaining a distribution over its parameters. When solving a bandit task, that uncertainty is marginalized out and properly accounted for. AdaTS is a fully-Bayesian algorithm that can be implemented efficiently in several classes of bandit problems. We derive upper bounds on its Bayes regret that quantify… 

Figures from this paper

Meta-Learning for Simple Regret Minimization
The first Bayesian and frequentist algorithms for this meta-learning problem for simple regret minimization in bandits are proposed and instantiate their algorithms for several classes of bandit problems.
Thompson Sampling for Robust Transfer in Multi-Task Bandits
This work presents a TS-type algorithm for a more general online multi-task learning protocol, which extends the concurrent setting and demonstrates that the algorithm enjoys superior empirical performance in comparison with the UCB-based algorithm and a baseline algorithm that performs TS for each individual task without transfer.
Metalearning Linear Bandits by Prior Update
This work proves, in the context of stochastic linear bandits and Gaussian priors, that as long as the prior is sufficiently close to the true prior, the performance of the applied algorithm is close to that of the algorithm that uses thetrue prior.
Hierarchical Bayesian Bandits
This work proposes and analyzes a natural hierarchical Thompson sampling algorithm (HierTS) for this class of problems, and confirms that hierarchical Bayesian bandits are a universal and statistically-efficient tool for learning to act with similar bandit tasks.
Meta-Learning Hypothesis Spaces for Sequential Decision-making
This work proposes to meta-learn a kernel from offline data and demonstrates the approach on the kernelized bandit problem (a.k.a. Bayesian optimization), where it is demonstrated that regret bounds competitive with those given the true kernel are established.
Generalizing Hierarchical Bayesian Bandits
A Thompson sampling algorithm G-HierTS is proposed that uses this structure to explore efficiently and bound its Bayes regret, and improves computational efficiency with a minimal impact on empirical regret.
Deep Hierarchy in Bandits
This work proposes a hierarchical Thompson sampling algorithm (HierTS) for this problem, and shows how to implement it efficiently for Gaussian hierarchies, and uses this exact posterior to analyze the Bayes regret of HierTS in Gaussian bandits.
Meta-Learning Adversarial Bandits
A unified meta-algorithm is designed that yields setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit linear optimization (BLO), and proves that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-convex sequence of affine functions of Bregman divergences that upper-bound the regret of OMD.
Towards Scalable and Robust Structured Bandits: A Meta-Learning Framework
A unified meta-learning framework for a general class of structured bandit problems where the parameter space can be factorized to item-level is proposed and a meta Thompson sampling algorithm is designed.
Multi-task Representation Learning with Stochastic Linear Bandits
This work proposes an efficient greedy policy that implicitly learns a low dimensional representation by encouraging the matrix formed by the task regression vectors to be of low rank, and derives an upper bound on the multi-task regret of this policy.


Szepesvari. Meta-Thompson sampling
  • In Proceedings of the 38th International Conference on Machine Learning,
  • 2021
Learning to Optimize via Posterior Sampling
A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards.
Meta-Thompson Sampling
This work proposes a variant of Thompson sampling that learns to explore better as it interacts with bandit instances drawn from an unknown prior, and derives a novel prior-dependent Bayes regret bound for Thompson sampling.
Differentiable Meta-Learning of Bandit Policies
This work parameterize policies in a differentiable way and optimize them by policy gradients, an approach that is pleasantly general and easy to implement, and observes that neural network policies can learn implicit biases expressed only through the sampled instances.
Provable Benefits of Representation Learning in Linear Bandits
A new algorithm is presented which achieves a corresponding regret bound which demonstrates the benefit of representation learning in certain regimes, and an $\Omega(T\sqrt{kN} + \sqrt {dkNT})$ regret lower bound is provided, showing that the algorithm is minimax-optimal up to poly-logarithmic factors.
Policy Gradient Optimization of Thompson Sampling Policies
This work views the posterior parameter sampled by Thompson sampling as a kind of pseudo-action, which can then be tractably applied to search over a class of sampling policies, which determine a probability distribution over pseudo-actions as a function of observed data.
Latent Bandits Revisited
This work proposes general algorithms for a latent bandit problem, based on both upper confidence bounds (UCBs) and Thompson sampling, which have lower regret than classic bandit policies when the number of latent states is smaller than actions.
Differentiable Meta-Learning in Contextual Bandits
The main idea in this work is to optimize differentiable bandit policies by policy gradients that reflect the structure of the problem, and propose contextual policies that are parameterized in a differentiable way and have low regret.
Differentiable Linear Bandit Algorithm
This work proposes a novel differentiable linear bandit algorithm that achieves a $\tilde{\mathcal{O}}(\hat{\beta}\sqrt{dT})$ upper bound of $T$-round regret, and introduces a gradient estimator, which allows the confidence bound to be learned via gradient ascent.
Meta-learning with Stochastic Linear Bandits
This work considers a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector.