An asymptotically optimal policy for finite support models in the multiarmed bandit problem

@article{Honda2011AnAO,
  title={An asymptotically optimal policy for finite support models in the multiarmed bandit problem},
  author={Junya Honda and Akimichi Takemura},
  journal={Machine Learning},
  year={2011},
  volume={85},
  pages={361-391}
}
In the multiarmed bandit problem the dilemma between exploration and exploitation in reinforcement learning is expressed as a model of a gambler playing a slot machine with multiple arms. A policy chooses an arm in each round so as to minimize the number of times that arms with suboptimal expected rewards are pulled. We propose the minimum empirical divergence (MED) policy and derive an upper bound on the finite-time regret which meets the asymptotic bound for the case of finite support models… Expand
An Asymptotically Optimal Bandit Algorithm for Bounded Support Models
TLDR
Deterministic Minimum Empirical Divergence policy is proposed and proved that DMED achieves the asymptotic bound and the index used in DMED for choosing an arm can be computed easily by a convex optimization technique. Expand
Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards
TLDR
This paper modifications this policy and derive a finite-time regret bound for the new policy, Indexed Minimum Empirical Divergence (IMED), by refining large deviation probabilities to a simple nonasymptotic form and shows that IMED much improves DMED and performs competitively to other state-of-the-art policies. Expand
Finite-time Regret Bound of a Bandit Algorithm for the Semi-bounded Support Model
In this paper we consider stochastic multiarmed bandit problems. Recently a policy, DMED, is proposed and proved to achieve the asymptotic bound for the model that each reward distribution isExpand
Minimal-Exploration Allocation Policies with Almost Sure , Arbitrarily Slow Growing Regret
This paper considers the problem of sampling with stochastic multi-armed bandits, when each arm generates i.i.d. rewards that follow an unknown distribution. Typical sampling policies in theExpand
Kullback–Leibler upper confidence bounds for optimal sequential allocation
We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins (1979), based on upperExpand
ASYMPTOTICALLY OPTIMAL MULTI-ARMED BANDIT POLICIES UNDER A COST CONSTRAINT
We consider the multi-armed bandit problem under a cost constraint. Successive samples from each population are i.i.d. with unknown distribution and each sample incurs a known population-dependentExpand
Optimal Data Driven Resource Allocation under Multi-Armed Bandit Observations
TLDR
This paper introduces the first asymptotically optimal strategy for a multi armed bandit (MAB) model under side constraints and provides the explicit form of such policies for the case in which the unknown distributions are Normal with unknown means and known variances. Expand
EXPLORATION–EXPLOITATION POLICIES WITH ALMOST SURE, ARBITRARILY SLOW GROWING ASYMPTOTIC REGRET
The purpose of this paper is to provide further understanding into the structure of the sequential allocation (“stochastic multi-armed bandit”) problem by establishing probability one finite horizonExpand
Sub-sampling for Multi-armed Bandits
TLDR
A novel algorithm that is based on sub-sampling that demonstrates excellent empirical performances against state-of-the-art algorithms, including Thompson sampling and KL-UCB is introduced. Expand
Asymptotic Behavior of Minimal-Exploration Allocation Policies: Almost Sure, Arbitrarily Slow Growing Regret
TLDR
The purpose of this paper is to provide further understanding into the structure of the sequential allocation problem by establishing probability one finite horizon bounds and convergence rates for the sample (or "pseudo") regret associated with two simple classes of allocation policies. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 39 REFERENCES
An Asymptotically Optimal Bandit Algorithm for Bounded Support Models
TLDR
Deterministic Minimum Empirical Divergence policy is proposed and proved that DMED achieves the asymptotic bound and the index used in DMED for choosing an arm can be computed easily by a convex optimization technique. Expand
Finite-time Analysis of the Multiarmed Bandit Problem
TLDR
This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support. Expand
The Continuum-Armed Bandit Problem
In this paper we consider the multiarmed bandit problem where the arms are chosen from a subset of the real line and the mean rewards are assumed to be a continuous function of the arms. The problemExpand
SAMPLE MEAN BASED INDEX POLICIES WITH O(logn) REGRET FOR THE MULTI-ARMED BANDIT PROBLEM
We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases sldwly with time. In their seminal work onExpand
The Nonstochastic Multiarmed Bandit Problem
TLDR
A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. Expand
PAC Bounds for Multi-armed Bandit and Markov Decision Processes
TLDR
The bandit problem is revisited and considered under the PAC model, and it is shown that given n arms, it suffices to pull the arms O(n/?2 log 1/?) times to find an ?-optimal arm with probability of at least 1 - ?. Expand
Nearly Tight Bounds for the Continuum-Armed Bandit Problem
TLDR
This work considers the case when the set of strategies is a subset of ℝd, and the cost functions are continuous, and improves on the best-known upper and lower bounds, closing the gap to a sublogarithmic factor. Expand
Multi-armed Bandit Algorithms and Empirical Evaluation
TLDR
A new algorithm is described and analyzed, Poker (Price Of Knowledge and Estimated Reward) whose performance compares favorably to that of other existing algorithms in several experiments and proves to be often hard to beat. Expand
Optimal Adaptive Policies for Sequential Allocation Problems
Consider the problem of sequential sampling frommstatistical populations to maximize the expected sum of outcomes in the long run. Under suitable assumptions on the unknown parametersformula, it isExpand
The Multi-Armed Bandit Problem: Decomposition and Computation
TLDR
It is shown that an approximate largest-index rule yields an approximately optimal policy for the N-project problem, and more efficient methods of computing the indices on-line and/or for sparse transition matrices in large state spaces than have been suggested heretofore. Expand
...
1
2
3
4
...