Distributed Bandits: Probabilistic Communication on d-regular Graphs

  title={Distributed Bandits: Probabilistic Communication on d-regular Graphs},
  author={Udari Madhushani and Naomi Ehrich Leonard},
  journal={2021 European Control Conference (ECC)},
We study the decentralized multi-agent multi-armed bandit problem for agents that communicate with probability over a network defined by a d-regular graph. Every edge in the graph has probabilistic weight p to account for the (1 − p) probability of a communication link failure. At each time step, each agent chooses an arm and receives a numerical reward associated with the chosen arm. After each choice, each agent observes the last obtained reward of each of its neighbors with probability p. We… 

Figures from this paper

One More Step Towards Reality: Cooperative Bandits with Imperfect Communication
This paper proposes decentralized algorithms that achieve competitive performance, along with near-optimal guarantees on the incurred group regret, and presents an improved delayed-update algorithm that outperforms the existing state-of-the-art on various network topologies.


Decentralized Cooperative Stochastic Bandits
A fully decentralized algorithm that uses an accelerated consensus procedure to compute (delayed) estimates of the average of rewards obtained by all the agents for each arm, and then uses an upper confidence bound (UCB) algorithm that accounts for the delay and error of the estimates.
Heterogeneous Stochastic Interactions for Multiple Agents in a Multi-armed Bandit Problem
An algorithm is designed for each agent to maximize its own expected cumulative reward and performance bounds that depend on the sociability of the agents and the network structure are proved.
A Dynamic Observation Strategy for Multi-agent Multi-armed Bandit Problem
A sampling algorithm and an observation protocol for each agent to maximize its own expected cumulative reward through minimizing expected cumulative sampling regret and expected cumulative observation regret is designed.
Social Imitation in Cooperative Multiarmed Bandits: Partition-Based Algorithms with Strictly Local Information
A novel policy based on partitions of the communication graph is developed and a distributed method for selecting an arbitrary number of leaders and partitions is proposed and evaluated using Monte-Carlo simulations.
Coordinated Versus Decentralized Exploration In Multi-Agent Multi-Armed Bandits
An algorithm for the decentralized setting that uses a value-ofinformation based communication strategy and an exploration-exploitation strategy based on the centralized algorithm is introduced, and it is shown experimentally that it converges rapidly to the performance of the centralized method.
Collaborative learning of stochastic bandits over a social network
A key finding of this paper is that natural extensions of widely-studied single agent learning policies to the network setting need not perform well in terms of regret.
Optimal Algorithms for Multiplayer Multi-Armed Bandits
DPE1 (Decentralized Parsimonious Exploration), a decentralized algorithm that achieves the same asymptotic regret as that obtained by an optimal centralized algorithm for Multiplayer Multi-Armed Bandit.
Decentralized Exploration in Multi-Armed Bandits
A generic algorithm Decentralized Elimination is provided, which uses any best arm identification algorithm as a subroutine, and it is proved that this algorithm insures privacy, with a low communication cost, and that in comparison to the lower bound of thebest arm identification problem, its sample complexity suffers from a penalty depending on the inverse of the probability of the most frequent players.
Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards
At each instant of time we are required to sample a fixed number m \geq 1 out of N i.i.d, processes whose distributions belong to a family suitably parameterized by a real number \theta . The
Algorithms for Differentially Private Multi-Armed Bandits
This work shows that there exist differentially private variants of Upper Confidence Bound algorithms which have optimal regret, and substantially improves the bounds of previous family of algorithms which use a continual release mechanism.