# Pure Exploration for Multi-Armed Bandit Problems

@article{Bubeck2008PureEF, title={Pure Exploration for Multi-Armed Bandit Problems}, author={S{\'e}bastien Bubeck and R{\'e}mi Munos and Gilles Stoltz}, journal={ArXiv}, year={2008}, volume={abs/0802.2655} }

We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of forecasters that perform an on-line exploration of the arms. These forecasters are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at… Expand

#### 29 Citations

Pure Exploration in Multi-armed Bandits Problems

- Mathematics, Computer Science
- ALT
- 2009

The main result is that the required exploration-exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret. Expand

Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits

- Computer Science
- AAAI
- 2012

Two pulling policies are developed, namely: (i) KUBE; and (ii) fractional KUBe, which are computationally less expensive and prove logarithmic upper bounds for the regret of both policies, and show that these bounds are asymptotically optimal. Expand

ε-first policies for budget-limited multi-armed bandits

- Computer Science
- AAAI 2010
- 2010

An ∊-first algorithm is proposed, in which the first ∊ of the budget is used solely to learn the arms' rewards (exploration), while the remaining 1 - ∊ is used to maximise the received reward based on those estimates (exploitation). Expand

Epsilon-First Policies for Budget-Limited Multi-Armed Bandits

- Computer Science
- AAAI
- 2010

An –first algorithm is proposed, in which the first of the budget is used solely to learn the arms’ rewards (exploration), while the remaining 1 is used to maximise the received reward based on those estimates (exploitation). Expand

ǫ – First Policies for Budget – Limited Multi-Armed Bandits Long

- 2010

We introduce the budget–limited multi–armed bandit (MAB), which captures situations where a learner’s actions are cos tly and constrained by a fixed budget that is incommensurable with the rewards… Expand

Multi-Bandit Best Arm Identification

- Computer Science
- NIPS
- 2011

This work proposes an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap), and introduces an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. Expand

Greedy Confidence Pursuit: A Pragmatic Approach to Multi-bandit Optimization

- Mathematics, Computer Science
- ECML/PKDD
- 2013

This work develops a method based on posterior sampling that outperforms existing methods for top-m selection in single bandits and improves on baseline methods for the full greedy confidence pursuit problem, which has not been studied previously. Expand

Gamification of Pure Exploration for Linear Bandits

- Computer Science, Mathematics
- ICML
- 2020

This work designs the first asymptotically optimal algorithm for fixed-confidence pure exploration in linear bandits, which naturally bypasses the pitfall caused by a simple but difficult instance, that most prior algorithms had to be engineered to deal with explicitly. Expand

Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization

- Computer Science, Mathematics
- ICML
- 2012

This work develops GP-BUCB, a principled algorithm for choosing batches, based on the GP-UCB algorithm for sequential GP optimization, and proves a surprising result; as compared to the sequential approach, the cumulative regret of the parallel algorithm only increases by a constant factor independent of the batch size B. Expand

Sequential Resource Allocation in Linear Stochastic Bandits

- Computer Science
- 2015

This thesis studies the design of sequences of actions that the agent should take to reach objectives such as: identifying the best value with a fixed confidence and using a minimum number of pulls, or minimizing the prediction error on the value of each action. Expand

#### References

SHOWING 1-10 OF 22 REFERENCES

Best Arm Identification in Multi-Armed Bandits

- Mathematics
- COLT 2010
- 2010

We consider the problem of finding the best arm in a stochastic multi-armed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the mean… Expand

Finite-time Analysis of the Multiarmed Bandit Problem

- Computer Science
- Machine Learning
- 2004

This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support. Expand

The non-stochastic multi-armed bandit problem

- Mathematics
- 2001

In the multi-armed bandit problem, a gambler must decide which arm of non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received… Expand

Exploration-exploitation tradeoff using variance estimates in multi-armed bandits

- Computer Science, Mathematics
- Theor. Comput. Sci.
- 2009

A variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms is considered, providing the first analysis of the expected regret for such algorithms. Expand

Tuning Bandit Algorithms in Stochastic Environments

- Mathematics, Computer Science
- ALT
- 2007

A variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms is considered and for the first time the concentration of the regret is analyzed. Expand

PAC Bounds for Multi-armed Bandit and Markov Decision Processes

- Computer Science
- COLT
- 2002

The bandit problem is revisited and considered under the PAC model, and it is shown that given n arms, it suffices to pull the arms O(n/?2 log 1/?) times to find an ?-optimal arm with probability of at least 1 - ?. Expand

The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

- Computer Science, Mathematics
- J. Mach. Learn. Res.
- 2003

This work considers the Multi-armed bandit problem under the PAC (“probably approximately correct”) model and generalizes the lower bound to a Bayesian setting, and to the case where the statistics of the arms are known but the identities of the Arms are not. Expand

Nearly Tight Bounds for the Continuum-Armed Bandit Problem

- Computer Science, Mathematics
- NIPS
- 2004

This work considers the case when the set of strategies is a subset of ℝd, and the cost functions are continuous, and improves on the best-known upper and lower bounds, closing the gap to a sublogarithmic factor. Expand

Online Optimization in X-Armed Bandits

- Computer Science, Mathematics
- NIPS
- 2008

The results imply that if Χ is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally Holder with a known exponent, then the expected regret is bounded up to a logarithmic factor by √n. Expand

The Budgeted Multi-armed Bandit Problem

- Computer Science
- COLT
- 2004

The following coins problem is a version of a multi-armed bandit problem where one has to select from among a set of objects, say classifiers, after an experimentation phase that is constrained by a… Expand