Approximation algorithms for restless bandit problems

@inproceedings{Guha2010ApproximationAF,
  title={Approximation algorithms for restless bandit problems},
  author={Sudipto Guha and Kamesh Munagala and Peng Shi},
  booktitle={JACM},
  year={2010}
}
The restless bandit problem is one of the most well-studied generalizations of the celebrated stochastic multi-armed bandit (MAB) problem in decision theory. In its ultimate generality, the restless bandit problem is known to be PSPACE-Hard to approximate to any nontrivial factor, and little progress has been made on this problem despite its significance in modeling activity allocation under uncertainty. In this article, we consider the Feedback MAB problem, where the reward obtained by… 

Figures from this paper

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret

TLDR
This paper proposes a learning algorithm with near-logarithmic regret uniformly over time with respect to the optimal (dynamic) finite horizon policy, referred to as strong regret, to contrast with commonly studied notion of weak regret.

Approximations of the Restless Bandit Problem

TLDR
It is shown that under some conditions on the $\varphi$-mixing coefficients, a modified version of UCB can prove effective, and a sub-class of the multi-armed restless bandit problem is characterized where approximate solutions can be found using tractable approaches.

Non-Stationary Bandits under Recharging Payoffs: Improved Planning with Sublinear Regret

TLDR
This work improves the best-known guarantees for the planning problem by developing a polynomial-time (1 − 1 /e ) -approximation algorithm (asymptotically and in expectation), based on a novel combination of randomized LP rounding and a time-correlated (interleaved) scheduling method.

Approximations of the Restless Bandit Problem

TLDR
A special setting is characterised where good approximate solutions can indeed be found via simple UCB-type algorithms, and it is shown that under some conditions on the φ-mixing coefficients, a modified version of UCB recovers the best achievable regret of the i.i.d. setting.

Optimality of Myopic Policy for Restless Multiarmed Bandit with Imperfect Observation

  • Kehao Wang
  • Mathematics
    2016 IEEE Global Communications Conference (GLOBECOM)
  • 2016
TLDR
This paper performs an analytical study on the considered RMAB problem, and establishes a set of closed-form conditions to guarantee the optimality of the myopic policy.

Multi-policy posterior sampling for restless Markov bandits

  • Suleman AlnatheerH. Man
  • Computer Science
    2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
  • 2014
TLDR
A polynomial time algorithm is proposed that learns transitional parameters for each arm and selects the perceived optimal policy from a set of predefined policies using a beliefs or probability distributions using randomized probability matching or better known as Thompson Sampling.

Near-optimality for infinite-horizon restless bandits with many arms

TLDR
By replacing a global Lagrange multiplier used by the Whittle index with a sequence of Lagrangian multipliers, one per time period up to a finite truncation point, a class of policies are derived that have a O(√N) optimality gap and are demonstrated to provide state-of-the-art performance on specific problems.

The non-Bayesian restless multi-armed bandit: A case of near-logarithmic regret

TLDR
This work develops an original approach to the RMAB problem that is applicable when the corresponding Bayesian problem has the structure that the optimal solution is one of a prescribed finite set of policies, and develops a novel sensing policy for opportunistic spectrum access over unknown dynamic channels.

The Non-Bayesian Restless Multi-Armed Bandit: A Case of Near-Logarithmic Strict Regret

TLDR
It is proved that the original approach to the non-Bayesian RMAB problem, in which the parameters of the Markov chain are assumed to be unknown, achieves near-logarithmic regret, which leads to the same average reward that can be achieved by the optimal policy under a known model.

Optimal Adaptive Learning in Uncontrolled Restless Bandit Problems

TLDR
This paper proposes a learning algorithm with logarithmic regret uniformly over time with respect to the optimal finite horizon policy for uncontrolled restless bandit problems, and extends the optimal adaptive learning of MDPs to POMDPs.
...

References

SHOWING 1-10 OF 72 REFERENCES

On Index Policies for Restless Bandit Problems

TLDR
This paper considers the restless bandit problem, which is one of the most well-studied generalizations of the celebrated stochastic multi-armed bandit problems in decision theory, and shows that for an interesting and general subclass that the authors term RECOVERING bandits, a surprisingly simple and intuitive greedy policy yields a factor 2 approximation.

Approximation Algorithms for Partial-Information Based Stochastic Control with Markovian Rewards

  • S. GuhaK. Munagala
  • Computer Science
    48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07)
  • 2007
TLDR
A constant factor approximation to the feedback MAB problem is designed by solving and rounding a natural LP relaxation to this problem, which is the first approximation algorithm for a POMDP problem.

The Nonstochastic Multiarmed Bandit Problem

TLDR
A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs.

Indexability of Restless Bandit Problems and Optimality of Whittle Index for Dynamic Multichannel Access

TLDR
This work establishes the indexability and obviates the need to know the Markov transition probabilities in Whittle index policy, and develops efficient algorithms for computing a performance upper bound given by Lagrangian relaxation.

Gambling in a rigged casino: The adversarial multi-armed bandit problem

TLDR
A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs is given.

Approximation algorithms for budgeted learning problems

TLDR
The first approximation algorithms for a large class of budgeted learning problems, including the budgeted multi-armed bandit problem, are presented, providing approximate policies that achieve a reward within constant factor of the reward optimal policy.

Finite-time Analysis of the Multiarmed Bandit Problem

TLDR
This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

Restless Bandits, Linear Programming Relaxations, and a Primal-Dual Index Heuristic

TLDR
A mathematical programming approach for the classicalPSPACE-hard restless bandit problem in stochastic optimization is developed and a priority-index heuristic scheduling policy from the solution to the firstorder relaxation is proposed, where the indices are defined in terms of optimal dual variables.

A Restless Bandit Formulation of Multi-channel Opportunistic Access: Indexablity and Index Policy

TLDR
This work forms the problem of optimal sequential channel selection as a restless multi-a rmed bandit process, for which a powerful index policy—Whittle’s index policy)—can be implemented with remarkably low complexity on the indexability of the system.

Adapting to a Changing Environment: the Brownian Restless Bandits

TLDR
The goal here is to characterize the cost of learning and adapting to the changing environment, in terms of the stochastic rate of the change, which is an infinite time horizon and defined with respect to a hypothetical algorithm that at every step plays the arm with the maximum expected reward at this step.
...