Multi-Armed Bandits: Theory and Applications to Online Learning in Networks

  title={Multi-Armed Bandits: Theory and Applications to Online Learning in Networks},
  author={Qing Zhao},
  booktitle={Multi-Armed Bandits},
  • Qing Zhao
  • Published in Multi-Armed Bandits 21 November 2019
  • Computer Science
Abstract Multi-armed bandit problems pertain to optimal sequential decision making and learning in unknown environments. Since the first bandit problem posed by Thompson in 1933 for the application... 

Distributed No-Regret Learning in Multi-Agent Systems

An overview of new challenges and representative results on distributed no-regret learning in multi-agent systems modeled as repeated unknown games is given.

Multi-Armed Bandits with Dependent Arms

Learning algorithms based on the UCB principle are developed which utilize these additional side observations appropriately while performing exploration-exploitation trade-off in the classical multi-armed bandit problem.

Memory-Constrained No-Regret Learning in Adversarial Multi-Armed Bandits

This work appears to be the first on memory-constrained bandit problems in the adversarial setting using a hierarchical learning framework that offers a sequence of operating points on the tradeoff curve between the regret order and memory complexity.

Regret of Age-of-Information Bandits in Nonstationary Wireless Networks

This work considers a wireless network with a source periodically generating time-sensitive information and transmitting it to a destination via one of ${N}$ non-stationary orthogonal wireless channels, in which the lower bound on the AoI regret achievable by any policy is derived.

A Fast-Pivoting Algorithm for Whittle’s Restless Bandit Index

A new fast-pivoting algorithm is obtained that computes the n Whittle index values of an n-state restless bandit by performing, after an initialization stage, n steps that entail (2/3)n3+O(n2) arithmetic operations.


It is shown that a neural network that produces the Whittle index is also one that producing the optimal control for a set of Markov decision problems, which motivates using deep reinforcement learning for the training of NeurWIN.

NeurWIN: Neural Whittle Index Network For Restless Bandits Via Deep RL

It is shown that a neural network that produces the Whittle index is also one that producing the optimal control for a set of Markov decision problems, which motivates using deep reinforcement learning for the training of NeurWIN.

Optimal Order Simple Regret for Gaussian Process Bandits

This work proves an Õ( √ γN/N) bound on the simple regret performance of a pure exploration algorithm that is significantly tighter than the existing bounds and is order optimal up to logarithmic factors for the cases where a lower bound on regret is known.

On Information Gain and Regret Bounds in Gaussian Process Bandits

General bounds on $\gamma_T$ are provided based on the decay rate of the eigenvalues of the GP kernel, whose specialisation for commonly used kernels, improves the existing bounds on $T$ and consequently the regret bounds relying on $gamma-T$ under numerous settings are provided.



Discrete multi-armed bandits and multi-parameter processes

The general multi-armed bandit problem is reformulated and solved as a control problem over a partially ordered set to provide a technically convenient framework for bandit-like problems and adds insight to the structure of strategies over partially ordered sets.

Linear Programming for Finite State Multi-Armed Bandit Problems

It is shown that when the state space is finite the computation of the dynamic allocation indices can be handled by linear programming methods.

Robust Risk-Averse Stochastic Multi-armed Bandits

An algorithm, called RA-UCB, is provided to solve a variant of the standard stochastic multi-armed bandit problem when one is not interested in the arm with the best mean, but instead in the hand maximizing some coherent risk measure criterion.

Bandits with budgets

This work derives regret bounds on the expected reward in such a bandit problem using a modification of the well-known upper confidence bound algorithm UCB1.

Generalized Bandit Problems

This chapter examines a number of extensions of the multi-armed bandit framework. We consider the possibility of an infinite number of available arms, we give conditions under which the Gittins index

The non-Bayesian restless multi-armed bandit: A case of near-logarithmic regret

This work develops an original approach to the RMAB problem that is applicable when the corresponding Bayesian problem has the structure that the optimal solution is one of a prescribed finite set of policies, and develops a novel sensing policy for opportunistic spectrum access over unknown dynamic channels.

Bandit problems with side observations

An extension of the traditional two-armed bandit problem is considered, in which the decision maker has access to some side information before deciding which arm to pull and how much the additional information helps is quantified.

The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

This work considers the Multi-armed bandit problem under the PAC (“probably approximately correct”) model and generalizes the lower bound to a Bayesian setting, and to the case where the statistics of the arms are known but the identities of the Arms are not.

Decentralized learning for multi-player multi-armed bandits

An online index-based learning policy called dUCB4 algorithm is proposed that trades off exploration v. exploitation in the right way, and achieves expected regret that grows at most near-O(log2 T).