• Corpus ID: 227275518

From Finite to Countable-Armed Bandits

  title={From Finite to Countable-Armed Bandits},
  author={Anand Kalvit and Assaf J. Zeevi},
We consider a stochastic bandit problem with countably many arms that belong to a finite set of types, each characterized by a unique mean reward. In addition, there is a fixed distribution over types which sets the proportion of each type in the population of arms. The decision maker is oblivious to the type of any arm and to the aforementioned distribution over types, but perfectly knows the total number of types occurring in the population of arms. We propose a fully adaptive online learning… 

Figures from this paper

Complexity Analysis of a Countable-armed Bandit Problem

While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.

The Countable-armed Bandit with Vanishing Arms

It is characterized necessary and sufficient conditions for achievability of sub-linear regret in terms of a critical vanishing rate of optimal arms, and two reservoir distribution-oblivious algorithms that are long-run-average optimal whenever sub- linear regret is statistically achievable are discussed.

Stochastic bandits with groups of similar arms

A lowerbound inspired strategy involving a computationally efficient relaxation that is based on a sorting mechanism achieves a lower bound close to the optimal one up to a controlled factor, and achieves an asymptotic regret q times smaller than the unstructured one.

Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms

An algorithm based on a reduction to bandit submodular maximization is designed, and it is shown that, for T rounds comprised of N tasks, in the regime of large number of tasks and small number of optimal arms M, its regret is smaller than the simple baseline of ˜ O ( √ KNT ) that can be obtained by using standard algorithms designed for non-stationary bandit problems.

Bandits with Dynamic Arm-acquisition Costs*

  • Anand KalvitA. Zeevi
  • Computer Science
    2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
  • 2022
A UCB-inspired adaptive algorithm that is long-run-average optimal whenever said condition is satisfied, thereby establishing its tightness is discussed, and a necessary condition for achievability of sub-linear regret in the problem is characterized.

Multi-Armed Bandits with Bounded Arm-Memory: Near-Optimal Guarantees for Best-Arm Identification and Regret Minimization

This work addresses the Stochastic Multi-armed Bandit problem from the perspective of two standard objectives: regret minimization, and best-arm identification, and provides an algorithm with arm-memory size of O (log ∗ n ) and O ( nε 2 · log( 1 δ )) optimal sample complexity.

A Closer Look at the Worst-case Behavior of Multi-armed Bandit Algorithms

It is shown that arm-sampling rates under UCB are asymptotically deterministic, regardless of the problem complexity, and the first complete process-level characterization of the MAB problem underUCB in the conventional diffusion scaling is provided.

Max-Utility Based Arm Selection Strategy For Sequential Query Recommendations

It is shown that in tasks like online information gathering, where sequential query recommendations are employed, the sequences of queries are correlated and the number of potentially optimal queries can be reduced to a manageable size by selecting queries with maximum utility with respect to the currently executing query.



Simple regret for infinitely many armed bandits

This paper proposes an algorithm aiming at minimizing the simple regret, and proves that depending on β, the algorithm is minimax optimal either up to a multiplicative constant orup to a log(n) factor.

The Continuum-Armed Bandit Problem

This paper constructs a class of certainty equivalence control with forcing schemes and derive asymptotic upper bounds on their learning loss that are stronger than the $o(n)$ required for optimality with respect to the average-cost-per-unit-time criterion.

Online Optimization in X-Armed Bandits

The results imply that if Χ is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally Holder with a known exponent, then the expected regret is bounded up to a logarithmic factor by √n.

Algorithms for Infinitely Many-Armed Bandits

A stochastic assumption is made on the mean-reward of a new selected arm which characterizes its probability of being a near-optimal arm and algorithms based on upper-confidence-bounds applied to a restricted set of randomly selected arms are described and provided on the resulting expected regret.

Bandit problems with infinitely many arms

We consider a bandit problem consisting of a sequence of n choices from an infinite number of Bernoulli arms, with n → ∞. The objective is to minimize the long-run failure rate. The Bernoulli

Nearly Tight Bounds for the Continuum-Armed Bandit Problem

This work considers the case when the set of strategies is a subset of ℝd, and the cost functions are continuous, and improves on the best-known upper and lower bounds, closing the gap to a sublogarithmic factor.

Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

A novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successesuntil the first failure and until the first m failures, respectively, achieves a long-term average regret in √2n for a large parameter m and a known time horizon n.

Multi-armed bandits in metric spaces

This work defines an isometry invariant Max Min COV(X) which bounds from below the performance of Lipschitz MAB algorithms for X, and presents an algorithm which comes arbitrarily close to meeting this bound.

Improved Rates for the Stochastic Continuum-Armed Bandit Problem

It is shown that apart from logarithmic factors, the expected regret scales with the square-root of the number of trials, provided that the mean payoff function has finitely many maxima and its second derivatives are continuous and non-vanishing at the maxima.

On Explore-Then-Commit strategies

Existing deviation inequalities are refined, which allow us to design fully sequential strategies with finite-time regret guarantees that are asymptotically optimal as the horizon grows and order-optimal in the minimax sense.