• Corpus ID: 226282413

# Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions

@inproceedings{Bubeck2021CooperativeAS,
title={Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions},
author={S'ebastien Bubeck and Thomas Budzinski and Mark Sellke},
booktitle={COLT},
year={2021}
}
• Published in COLT 8 November 2020
• Computer Science, Mathematics
We consider the cooperative multi-player version of the stochastic multi-armed bandit problem. We study the regime where the players cannot communicate but have access to shared randomness. In prior work by the first two authors, a strategy for this regime was constructed for two players and three arms, with regret $\tilde{O}(\sqrt{T})$, and with no collisions at all between the players (with very high probability). In this paper we show that these properties (near-optimal regret and no…
Multi-Player Multi-Armed Bandits With Collision-Dependent Reward Distributions
• Computer Science, Engineering
IEEE Transactions on Signal Processing
• 2021
The Error-Correction Collision Communication (EC3) algorithm is proposed that models implicit communication as a reliable communication over noisy channel problem, for which random coding error exponent is used to establish the optimal regret that no communication protocol can beat.
An Instance-Dependent Analysis for the Cooperative Multi-Player Multi-Armed Bandit
• Computer Science, Mathematics
ArXiv
• 2021
This work shows that a simple modification to a successive elimination strategy can be used to allow the players to estimate their suboptimality gaps, up to constant factors, in the absence of collisions, and designs a communication protocol that successfully uses the small reward of collisions to coordinate among players, while preserving meaningful instance-dependent logarithmic regret guarantees.
Heterogeneous Multi-player Multi-armed Bandits: Closing the Gap and Generalization
• Chengshuai Shi
• Computer Science, Mathematics
ArXiv
• 2021
BEACON bridges the algorithm design and regret analysis of combinatorial MAB (CMAB) and MP-MAB, two largely disjointed areas in MAB, and the results suggest that this previously ignored connection is worth further investigation.
Decentralized Cooperative Reinforcement Learning with Hierarchical Information Structure
• Computer Science, Mathematics
ArXiv
• 2021
This work considers two-agent multi-armed bandits and Markov decision processes with a hierarchical information structure arising in applications to propose simpler and more efficient algorithms that require no coordination or communication.
Decentralized Learning in Online Queuing Systems
• Computer Science, Mathematics
ArXiv
• 2021
Cooperative queues are considered and the first learning decentralized algorithm guaranteeing stability of the system as long as the ratio of rates is larger than 1 is proposed, thus reaching performances comparable to centralized strategies.
Collaborative Pure Exploration in Kernel Bandit
• Yihan Du, Wei Chen
• Computer Science
ArXiv
• 2021
In this paper, we formulate a Collaborative Pure Exploration in Kernel Bandit problem (CoPE-KB), which provides a novel model for multi-agent multi-task decision making under limited communication
Bandit Learning in Decentralized Matching Markets
• Computer Science, Mathematics
ArXiv
• 2020
This model extends the standard stochastic multi-armed bandit framework to a decentralized multiple player setting with competition and introduces a new algorithm for this setting that attains stable regret when preferences of the arms over players are shared.

## References

SHOWING 1-10 OF 12 REFERENCES
Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without
• Computer Science, Mathematics
COLT
• 2020
The first $\sqrt{T}$-type regret guarantee for this problem is proved, under the feedback model where collisions are announced to the colliding players, and it is proved that $T-m$ is the number of players.
Distributed Learning in Multi-Armed Bandit With Multiple Players
• Computer Science, Mathematics
IEEE Transactions on Signal Processing
• 2010
It is shown that the minimum system regret of the decentralized MAB grows with time at the same logarithmic order as in the centralized counterpart where players act collectively as a single entity by exchanging observations and making decisions jointly.
Multiplayer bandits without observing collision information
• Mathematics, Computer Science
ArXiv
• 2018
An algorithm for reaching approximate Nash equilibria quickly in stochastic anticoordination games and the first square-root regret bounds that do not depend on the gaps between the means are given.
Multi-Player Bandits - a Musical Chairs Approach
• Computer Science, Mathematics
ICML
• 2016
This work provides a communication-free algorithm (Musical Chairs) which attains constant regret with high probability, as well as a sublinear-regret, communication- free algorithm (Dynamic Musical Ch chairs) for the more difficult setting of players dynamically entering and leaving throughout the game.
Concurrent Bandits and Cognitive Radio Networks
• Computer Science
ECML/PKDD
• 2014
An algorithm is proposed that combines an e-greedy learning rule with a collision avoidance mechanism that shows that sub-linear regret can be obtained in this setting and shows dramatic improvement compared to other algorithms for this setting.
SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits
• Computer Science, Mathematics
NeurIPS
• 2019
A decentralized algorithm is presented that achieves the same performance as a centralized one, contradicting the existing lower bounds for the stochastic multiplayer multi-armed bandit problem and showing that the logarithmic growth of the regret is still achievable for this model with a new algorithm.
• Computer Science, Mathematics
J. Mach. Learn. Res.
• 2020
This work designs the first Multi-player Bandit algorithm that provably works in arbitrarily changing environments, where the losses of the arms may even be chosen by an adversary.
Medium access in cognitive radio networks: A competitive multi-armed bandit framework
• Computer Science
2008 42nd Asilomar Conference on Signals, Systems and Computers
• 2008
Low complexity medium access protocols are developed which strike an optimal balance between exploration and exploitation in such competitive environments, and the operating points of these low complexity protocols are shown to converge to those of the scenario in which the parameters are known.
Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret
• Computer Science, Mathematics
IEEE Journal on Selected Areas in Communications
• 2011
This work proposes policies for distributed learning and access which achieve order-optimal cognitive system throughput under self play, i.e., when implemented at all the secondary users, and proposes a policy whose sum regret grows only slightly faster than logarithmic in the number of transmission slots.
Multi-Armed Bandit Learning in IoT Networks: Learning Helps Even in Non-stationary Settings
• Computer Science
CrownCom
• 2017
It is proved that intelligent devices in unlicensed bands can use Multi-Armed Bandit (MAB) learning algorithms to improve resource exploitation, and stochastic MAB learning provides a up to 16% gain in term of successful transmission probabilities, and has near optimal performance even in non-stationary and non-i.i.d. settings.