# Approximation algorithms for restless bandit problems

@inproceedings{Guha2010ApproximationAF,
title={Approximation algorithms for restless bandit problems},
author={Sudipto Guha and Kamesh Munagala and Peng Shi},
booktitle={JACM},
year={2010}
}
• Published in JACM 25 November 2007
• Computer Science, Mathematics
The restless bandit problem is one of the most well-studied generalizations of the celebrated stochastic multi-armed bandit (MAB) problem in decision theory. In its ultimate generality, the restless bandit problem is known to be PSPACE-Hard to approximate to any nontrivial factor, and little progress has been made on this problem despite its significance in modeling activity allocation under uncertainty. In this article, we consider the Feedback MAB problem, where the reward obtained by…
126 Citations

## Figures from this paper

### Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret

• Computer Science
• 2013
This paper proposes a learning algorithm with near-logarithmic regret uniformly over time with respect to the optimal (dynamic) finite horizon policy, referred to as strong regret, to contrast with commonly studied notion of weak regret.

### Approximations of the Restless Bandit Problem

• Computer Science
J. Mach. Learn. Res.
• 2019
It is shown that under some conditions on the $\varphi$-mixing coefficients, a modified version of UCB can prove effective, and a sub-class of the multi-armed restless bandit problem is characterized where approximate solutions can be found using tractable approaches.

### Non-Stationary Bandits under Recharging Payoffs: Improved Planning with Sublinear Regret

• Computer Science
ArXiv
• 2022
This work improves the best-known guarantees for the planning problem by developing a polynomial-time (1 − 1 /e ) -approximation algorithm (asymptotically and in expectation), based on a novel combination of randomized LP rounding and a time-correlated (interleaved) scheduling method.

### Approximations of the Restless Bandit Problem

A special setting is characterised where good approximate solutions can indeed be found via simple UCB-type algorithms, and it is shown that under some conditions on the φ-mixing coefficients, a modified version of UCB recovers the best achievable regret of the i.i.d. setting.

### Optimality of Myopic Policy for Restless Multiarmed Bandit with Imperfect Observation

• Kehao Wang
• Mathematics
2016 IEEE Global Communications Conference (GLOBECOM)
• 2016
This paper performs an analytical study on the considered RMAB problem, and establishes a set of closed-form conditions to guarantee the optimality of the myopic policy.

### Multi-policy posterior sampling for restless Markov bandits

• Computer Science
2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
• 2014
A polynomial time algorithm is proposed that learns transitional parameters for each arm and selects the perceived optimal policy from a set of predefined policies using a beliefs or probability distributions using randomized probability matching or better known as Thompson Sampling.

### Near-optimality for infinite-horizon restless bandits with many arms

• Computer Science, Economics
ArXiv
• 2022
By replacing a global Lagrange multiplier used by the Whittle index with a sequence of Lagrangian multipliers, one per time period up to a finite truncation point, a class of policies are derived that have a O(√N) optimality gap and are demonstrated to provide state-of-the-art performance on specific problems.

### The non-Bayesian restless multi-armed bandit: A case of near-logarithmic regret

• Computer Science
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
• 2011
This work develops an original approach to the RMAB problem that is applicable when the corresponding Bayesian problem has the structure that the optimal solution is one of a prescribed finite set of policies, and develops a novel sensing policy for opportunistic spectrum access over unknown dynamic channels.

### The Non-Bayesian Restless Multi-Armed Bandit: A Case of Near-Logarithmic Strict Regret

• Computer Science, Mathematics
ArXiv
• 2011
It is proved that the original approach to the non-Bayesian RMAB problem, in which the parameters of the Markov chain are assumed to be unknown, achieves near-logarithmic regret, which leads to the same average reward that can be achieved by the optimal policy under a known model.

### Optimal Adaptive Learning in Uncontrolled Restless Bandit Problems

• Computer Science
• 2012
This paper proposes a learning algorithm with logarithmic regret uniformly over time with respect to the optimal finite horizon policy for uncontrolled restless bandit problems, and extends the optimal adaptive learning of MDPs to POMDPs.

## References

SHOWING 1-10 OF 72 REFERENCES

### On Index Policies for Restless Bandit Problems

• Computer Science
• 2007
This paper considers the restless bandit problem, which is one of the most well-studied generalizations of the celebrated stochastic multi-armed bandit problems in decision theory, and shows that for an interesting and general subclass that the authors term RECOVERING bandits, a surprisingly simple and intuitive greedy policy yields a factor 2 approximation.

### Approximation Algorithms for Partial-Information Based Stochastic Control with Markovian Rewards

• Computer Science
48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07)
• 2007
A constant factor approximation to the feedback MAB problem is designed by solving and rounding a natural LP relaxation to this problem, which is the first approximation algorithm for a POMDP problem.

### The Nonstochastic Multiarmed Bandit Problem

• Computer Science, Economics
SIAM J. Comput.
• 2002
A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs.

### Indexability of Restless Bandit Problems and Optimality of Whittle Index for Dynamic Multichannel Access

• Mathematics
IEEE Transactions on Information Theory
• 2010
This work establishes the indexability and obviates the need to know the Markov transition probabilities in Whittle index policy, and develops efficient algorithms for computing a performance upper bound given by Lagrangian relaxation.

### Gambling in a rigged casino: The adversarial multi-armed bandit problem

• Computer Science
Proceedings of IEEE 36th Annual Foundations of Computer Science
• 1995
A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs is given.

### Approximation algorithms for budgeted learning problems

• Computer Science
STOC '07
• 2007
The first approximation algorithms for a large class of budgeted learning problems, including the budgeted multi-armed bandit problem, are presented, providing approximate policies that achieve a reward within constant factor of the reward optimal policy.

### Finite-time Analysis of the Multiarmed Bandit Problem

• Computer Science
Machine Learning
• 2004
This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

### Restless Bandits, Linear Programming Relaxations, and a Primal-Dual Index Heuristic

• Computer Science
Oper. Res.
• 2000
A mathematical programming approach for the classicalPSPACE-hard restless bandit problem in stochastic optimization is developed and a priority-index heuristic scheduling policy from the solution to the firstorder relaxation is proposed, where the indices are defined in terms of optimal dual variables.

### A Restless Bandit Formulation of Multi-channel Opportunistic Access: Indexablity and Index Policy

• Computer Science
ArXiv
• 2008
This work forms the problem of optimal sequential channel selection as a restless multi-a rmed bandit process, for which a powerful index policy—Whittle’s index policy)—can be implemented with remarkably low complexity on the indexability of the system.

### Adapting to a Changing Environment: the Brownian Restless Bandits

• Computer Science
COLT
• 2008
The goal here is to characterize the cost of learning and adapting to the changing environment, in terms of the stochastic rate of the change, which is an infinite time horizon and defined with respect to a hypothetical algorithm that at every step plays the arm with the maximum expected reward at this step.