• Corpus ID: 235731508

Bayesian decision-making under misspecified priors with applications to meta-learning

@article{Simchowitz2021BayesianDU,
  title={Bayesian decision-making under misspecified priors with applications to meta-learning},
  author={Max Simchowitz and Christopher Tosh and Akshay Krishnamurthy and Daniel J. Hsu and Thodoris Lykouris and Miroslav Dud'ik and Robert E. Schapire},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.01509}
}
Thompson sampling and other Bayesian sequential decision-making algorithms are among the most popular approaches to tackle explore/exploit trade-offs in (contextual) bandits. The choice of prior in these algorithms offers flexibility to encode domain knowledge but can also lead to poor performance when misspecified. In this paper, we demonstrate that performance degrades gracefully with misspecification. We prove that the expected reward accrued by Thompson sampling (TS) with a misspecified… 

Figures from this paper

Meta-Learning Hypothesis Spaces for Sequential Decision-making

This work proposes to meta-learn a kernel from offline data and demonstrates the approach on the kernelized bandit problem (a.k.a. Bayesian optimization), where it is demonstrated that regret bounds competitive with those given the true kernel are established.

Generalizing Hierarchical Bayesian Bandits

A Thompson sampling algorithm G-HierTS is proposed that uses this structure to explore efficiently and bound its Bayes regret, and improves computational efficiency with a minimal impact on empirical regret.

Mixed-Effect Thompson Sampling

A general framework for capturing correlations through a mixed-effect model where actions are related through multiple shared effect parameters is introduced and validated empirically using both synthetic and real-world problems.

Meta-Learning for Simple Regret Minimization

The first Bayesian and frequentist algorithms for this meta-learning problem for simple regret minimization in bandits are proposed and instantiate their algorithms for several classes of bandit problems.

Tractable Optimality in Episodic Latent MABs

This work shows that learning with polynomial samples in A is possible, and designs a procedure that provably learns a near-optimal policy with O (poly( A )+poly( M, H ) min(M,H ) ) interactions, and can formulate the moment-matching via maximum likelihood estimation.

Hierarchical Bayesian Bandits

This work proposes and analyzes a natural hierarchical Thompson sampling algorithm (HierTS) for this class of problems, and confirms that hierarchical Bayesian bandits are a universal and statistically-efficient tool for learning to act with similar bandit tasks.

Adaptivity and Confounding in Multi-Armed Bandit Experiments

The main insight is that an algorithm called deconfounded Thompson sampling strikes a delicate balance between adaptivity and robustness, which leads to optimal efficiency properties in easy stationary instances, but it displays surprising resilience in hard nonstationary ones which cause other adaptive algorithms to fail.

Meta-Learning Adversarial Bandits

A unified meta-algorithm is designed that yields setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit linear optimization (BLO), and proves that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-convex sequence of affine functions of Bregman divergences that upper-bound the regret of OMD.

Metadata-based Multi-Task Bandits with Bayesian Hierarchical Models

This paper introduces the metadata-based multi-task bandit problem, where the agent needs to solve a large number of related multi-armed bandit tasks and can leverage some task-specific features to share knowledge across tasks.

Gaussian Imagination in Bandit Learning

The results formalize the folklore that so-called Bayesian agents remain effective when instantiated with diffuse misspecified distributions.

References

SHOWING 1-10 OF 54 REFERENCES

On the Prior Sensitivity of Thompson Sampling

This paper fully characterize the Thompson Sampling algorithm's worst-case dependence of regret on the choice of prior, focusing on a special yet representative case, and proves regret upper bounds for the bad- and good-prior cases, respectively.

Adapting to Misspecification in Contextual Bandits

This work introduces a new family of oracle-efficient algorithms for ε-misspecified contextual bandits that adapt to unknown model misspecification—both for finite and infinite action settings, and achieves the first algorithm that achieves the optimal regret bound for unknown ε.

Meta Dynamic Pricing: Transfer Learning Across Experiments

A meta dynamic pricing algorithm that learns a prior online while solving a sequence of Thompson sampling pricing experiments for N different products, demonstrating that the price of an unknown prior in Thompson sampling can be negligible in experiment-rich environments.

Bayesian Optimal Control of Smoothly Parameterized Systems

A lazy version of the so-called posterior sampling method, a method that goes back to Thompson and Strens, that allows for a single algorithm and a single analysis for a wide range of problems, such as finite MDPs or linear quadratic regulation.

Hedging the Drift: Learning to Optimize under Non-Stationarity

This work introduces data-driven decision-making algorithms that achieve state-of-the-art dynamic regret bounds for a collection of non-stationary stochastic bandit settings and leverages the power of the "forgetting principle" in the learning processes, which is vital in changing environments.

Bayesian Reinforcement Learning: A Survey

An in-depth review of the role of Bayesian methods for the reinforcement learning (RL) paradigm, and a comprehensive survey on Bayesian RL algorithms and their theoretical and empirical properties.

Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

We propose a black-box reduction that turns a certain reinforcement learning algorithm with optimal regret in a (near-)stationary environment into another algorithm with optimal dynamic regret in a

On the sample complexity of reinforcement learning.

Novel algorithms with more restricted guarantees are suggested whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class, but have only a polynomial dependence on the horizon time.

Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards

This paper fully characterize the (regret) complexity of this class of MAB problems by establishing a direct link between the extent of allowable reward "variation" and the minimal achievable regret, and by established a connection between the adversarial and the stochastic MAB frameworks.

Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems

A novel bound on the regret due to policy switches is obtained, which holds for LQ systems of any dimensionality and it allows updating the parameters and the policy at each step, thus overcoming previous limitations due to lazy updates.
...