• Corpus ID: 235435944

Thompson Sampling for Unimodal Bandits

  title={Thompson Sampling for Unimodal Bandits},
  author={Long Yang and Zhao Li and Zehong Hu and Shasha Ruan and Shijian Li and Gang Pan and Hongyang Chen},
In this paper, we propose a Thompson Sampling algorithm for unimodal bandits, where the expected reward is unimodal over the partially ordered arms. To exploit the unimodal structure better, at each step, instead of exploration from the entire decision space, our algorithm makes decision according to posterior distribution only in the neighborhood of the arm that has the highest empirical mean estimate. We theoretically prove that, for Bernoulli rewards, the regret of our algorithm reaches the… 

Figures and Tables from this paper



Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms

It is shown that combining an appropriate discretization of the set of arms with the UCB algorithm yields an order-optimal regret, and in practice, outperforms recently proposed algorithms designed to exploit the unimodal structure.

Unimodal Bandits

We consider multiarmed bandit problems where the expected reward is unimodal over partially ordered arms. In particular, the arms may belong to a continuous interval or correspond to vertices in a

Unimodal Thompson Sampling for Graph-Structured Arms

A Thompson Sampling-based algorithm whose asymptotic pseudo-regret matches the lower bound for the considered setting and it is shown that Bayesian MAB algorithms dramatically outperform frequentist ones.

Solving Bernoulli Rank-One Bandits with Unimodal Thompson Sampling

An asymptotically optimal regret bound is proved on the frequentist regret of UTS and simulations showing the significant improvement of the method compared to the state-of-the-art are supported.

Thompson Sampling Algorithms for Mean-Variance Bandits

Thompson Sampling-style algorithms for mean-variance MAB and comprehensive regret analyses for Gaussian and Bernoulli bandits with fewer assumptions are developed and shown to significantly outperform existing LCB-based algorithms for all risk tolerances.

Further Optimal Regret Bounds for Thompson Sampling

A novel regret analysis for Thompson Sampling is provided that proves the first near-optimal problem-independent bound of O( √ NT lnT ) on the expected regret of this algorithm, and simultaneously provides the optimal problem-dependent bound.

Learning to Optimize via Posterior Sampling

A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards.

Kullback–Leibler upper confidence bounds for optimal sequential allocation

The main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas and Katehakis (1996), respectively.

Thompson Sampling for Complex Online Problems

It is proved a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them, and improved regret bounds are derived for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear reward feedback from subsets.

Finite-time Analysis of the Multiarmed Bandit Problem

This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.