Batched Bandit Problems

  title={Batched Bandit Problems},
  author={Vianney Perchet and Philippe Rigollet and Sylvain Chassang and Erik Snowberg},
  journal={Princeton University William S. Dietrich II Economic Theory Center Research Paper Series},
Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. We propose a simple policy, and show that a very small number of batches gives close to minimax optimal regret bounds. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits. 

Figures from this paper

Batched Thompson Sampling for Multi-Armed Bandits
This work analyzes Thompson Sampling algorithms for stochastic multiarmed bandits in the batched setting and obtains almost tight regret-batches tradeoffs for the two-arm case.
Batched Neural Bandits
This work proposes the BatchNeuralUCB algorithm which combines neural networks with optimism to address the exploration-exploitation tradeoff while keeping the total number of batches limited and proves that it achieves the same regret as the fully sequential version while reducing the number of policy updates considerably.
Regret Bounds for Batched Bandits
Algorithms for the batched stochastic multi-armed bandit and batched Stochastic linear bandit problems are presented and bounds for their expected regrets are proved that improve over the best-known regret bounds for any number of batches.
The Impact of Batch Learning in Stochastic Bandits
This work provides a policy-agnostic regret analysis and demonstrates upper and lower bounds for the regret of a candidate policy and shows that the impact of batch learning can be measured in terms of online behavior.
Batched Multi-armed Bandits Problem
The BaSE (batched successive elimination) policy is proposed to achieve the rate-optimal regrets (within logarithmic factors) for batched multi-armed bandits, with matching lower bounds even if the batch sizes are determined in an adaptive manner.
The Impact of Batch Learning in Stochastic Linear Bandits
This work provides a policyagnostic regret analysis and demonstrates upper and lower bounds for the regret of a candidate policy and provides a more robust result for the 2-armed bandit problem as an important insight.
Invariant description of UCB strategy for multi-armed bandits for batch processing scenario
  • S. Garbar
  • Computer Science
    2020 24th International Conference on Circuits, Systems, Communications and Computers (CSCC)
  • 2020
In this work, a set of Monte-Carlo simulations are performed for different horizon sizes, parameters of the strategy and batch sizes to determine the maximum regret for two-armed bandits.
A Sharp Memory-Regret Trade-Off for Multi-Pass Streaming Bandits
The main technical contribution is the lower bound which requires the use of information-theoretic techniques as well as ideas from round elimination to show that the residual problem remains challenging over subsequent passes.
Anytime optimal algorithms in stochastic multi-armed bandits
We introduce an anytime algorithm for stochastic multi-armed bandit with optimal distribution free and distribution dependent bounds (for a specific family of parameters). The performances of this
Fast Rates for Bandit Optimization with Upper-Confidence Frank-Wolfe
The Upper-Confidence Frank-Wolfe algorithm is analyzed, inspired by techniques for bandits and convex optimization, and theoretical guarantees for the performance of this algorithm over various classes of functions are given.


The multi-armed bandit problem with covariates
This work introduces a policy called Adaptively Binned Successive Elimination (abse) that adaptively decomposes the global problem into suitably “localized” static bandit problems and introduces a nonparametric model where the expected rewards are smooth functions of the covariate and the hardness of the problem is captured by a margin parameter.
Bounded regret in stochastic multi-armed bandits
A new randomized policy is proposed that attains a regret {\em uniformly bounded over time} in this setting and several lower bounds are proved, which show in particular that bounded regret is not possible if one only knows $\Delta$, and bounded regret of order $1/\Delta$ is not Possible.
Finite-time Analysis of the Multiarmed Bandit Problem
This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem
For this modified UCB algorithm, an improved bound on the regret is given with respect to the optimal reward for K-armed bandits after T trials.
Asymptotically optimal multistage tests of simple hypotheses
A family of variable stage size multistage tests of simple hypotheses is described, based on efficient multistage sampling procedures. Using a loss function that is a linear combination of sampling
Regret Bounds and Minimax Policies under Partial Monitoring
The stochastic bandit game is considered, and it is proved that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays.
A Model for Selecting One of Two Medical Treatments
Abstract A simple cost function approach is proposed for designing an optimal clinical trial when a total of N patients with a disease are to be treated with one of two medical treatments. The cost
Kullback–Leibler upper confidence bounds for optimal sequential allocation
The main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas and Katehakis (1996), respectively.
Sequential Experimentation in Clinical Trials: Design and Analysis
The results suggest that the design of Sequential Testing Theory and Stochastic Optimization over Time in Clinical Trials with Failure-Time Endpoints is a good guide for designing Sequential Methods for Vaccine Safety Evaluation and Surveillance in Public Health.
Randomized Allocation of Treatments in Sequential Experiments
SUMMARY Since the idea of sequential allocation was first studied, in a version of what is now called the multi-armed bandit problem, the results of many investigations have shown that, even when an