Corpus ID: 51971065

Pure Exploration for Multi-Armed Bandit Problems

@article{Bubeck2008PureEF,
  title={Pure Exploration for Multi-Armed Bandit Problems},
  author={S{\'e}bastien Bubeck and R. Munos and Gilles Stoltz},
  journal={ArXiv},
  year={2008},
  volume={abs/0802.2655}
}
We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of forecasters that perform an on-line exploration of the arms. These forecasters are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at… Expand
Pure Exploration in Multi-armed Bandits Problems
TLDR
The main result is that the required exploration-exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret. Expand
Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits
TLDR
Two pulling policies are developed, namely: (i) KUBE; and (ii) fractional KUBe, which are computationally less expensive and prove logarithmic upper bounds for the regret of both policies, and show that these bounds are asymptotically optimal. Expand
ε-first policies for budget-limited multi-armed bandits
TLDR
An ∊-first algorithm is proposed, in which the first ∊ of the budget is used solely to learn the arms' rewards (exploration), while the remaining 1 - ∊ is used to maximise the received reward based on those estimates (exploitation). Expand
Epsilon-First Policies for Budget-Limited Multi-Armed Bandits
TLDR
An –first algorithm is proposed, in which the first of the budget is used solely to learn the arms’ rewards (exploration), while the remaining 1 is used to maximise the received reward based on those estimates (exploitation). Expand
ǫ – First Policies for Budget – Limited Multi-Armed Bandits Long
We introduce the budget–limited multi–armed bandit (MAB), which captures situations where a learner’s actions are cos tly and constrained by a fixed budget that is incommensurable with the rewardsExpand
Multi-Bandit Best Arm Identification
TLDR
This work proposes an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap), and introduces an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. Expand
Greedy Confidence Pursuit: A Pragmatic Approach to Multi-bandit Optimization
TLDR
This work develops a method based on posterior sampling that outperforms existing methods for top-m selection in single bandits and improves on baseline methods for the full greedy confidence pursuit problem, which has not been studied previously. Expand
Gamification of Pure Exploration for Linear Bandits
TLDR
This work designs the first asymptotically optimal algorithm for fixed-confidence pure exploration in linear bandits, which naturally bypasses the pitfall caused by a simple but difficult instance, that most prior algorithms had to be engineered to deal with explicitly. Expand
Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization
TLDR
This work develops GP-BUCB, a principled algorithm for choosing batches, based on the GP-UCB algorithm for sequential GP optimization, and proves a surprising result; as compared to the sequential approach, the cumulative regret of the parallel algorithm only increases by a constant factor independent of the batch size B. Expand
Sequential Resource Allocation in Linear Stochastic Bandits
TLDR
This thesis studies the design of sequences of actions that the agent should take to reach objectives such as: identifying the best value with a fixed confidence and using a minimum number of pulls, or minimizing the prediction error on the value of each action. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 22 REFERENCES
Best Arm Identification in Multi-Armed Bandits
We consider the problem of finding the best arm in a stochastic multi-armed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the meanExpand
Finite-time Analysis of the Multiarmed Bandit Problem
TLDR
This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support. Expand
The non-stochastic multi-armed bandit problem
In the multi-armed bandit problem, a gambler must decide which arm of non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has receivedExpand
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits
TLDR
A variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms is considered, providing the first analysis of the expected regret for such algorithms. Expand
Tuning Bandit Algorithms in Stochastic Environments
TLDR
A variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms is considered and for the first time the concentration of the regret is analyzed. Expand
PAC Bounds for Multi-armed Bandit and Markov Decision Processes
TLDR
The bandit problem is revisited and considered under the PAC model, and it is shown that given n arms, it suffices to pull the arms O(n/?2 log 1/?) times to find an ?-optimal arm with probability of at least 1 - ?. Expand
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem
TLDR
This work considers the Multi-armed bandit problem under the PAC (“probably approximately correct”) model and generalizes the lower bound to a Bayesian setting, and to the case where the statistics of the arms are known but the identities of the Arms are not. Expand
Nearly Tight Bounds for the Continuum-Armed Bandit Problem
TLDR
This work considers the case when the set of strategies is a subset of ℝd, and the cost functions are continuous, and improves on the best-known upper and lower bounds, closing the gap to a sublogarithmic factor. Expand
Online Optimization in X-Armed Bandits
TLDR
The results imply that if Χ is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally Holder with a known exponent, then the expected regret is bounded up to a logarithmic factor by √n. Expand
The Budgeted Multi-armed Bandit Problem
The following coins problem is a version of a multi-armed bandit problem where one has to select from among a set of objects, say classifiers, after an experimentation phase that is constrained by aExpand
...
1
2
3
...