• Corpus ID: 226282413

Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions

@inproceedings{Bubeck2021CooperativeAS,
  title={Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions},
  author={S'ebastien Bubeck and Thomas Budzinski and Mark Sellke},
  booktitle={COLT},
  year={2021}
}
We consider the cooperative multi-player version of the stochastic multi-armed bandit problem. We study the regime where the players cannot communicate but have access to shared randomness. In prior work by the first two authors, a strategy for this regime was constructed for two players and three arms, with regret $\tilde{O}(\sqrt{T})$, and with no collisions at all between the players (with very high probability). In this paper we show that these properties (near-optimal regret and no… 
Multi-Player Multi-Armed Bandits With Collision-Dependent Reward Distributions
TLDR
The Error-Correction Collision Communication (EC3) algorithm is proposed that models implicit communication as a reliable communication over noisy channel problem, for which random coding error exponent is used to establish the optimal regret that no communication protocol can beat.
An Instance-Dependent Analysis for the Cooperative Multi-Player Multi-Armed Bandit
TLDR
This work shows that a simple modification to a successive elimination strategy can be used to allow the players to estimate their suboptimality gaps, up to constant factors, in the absence of collisions, and designs a communication protocol that successfully uses the small reward of collisions to coordinate among players, while preserving meaningful instance-dependent logarithmic regret guarantees.
Heterogeneous Multi-player Multi-armed Bandits: Closing the Gap and Generalization
TLDR
BEACON bridges the algorithm design and regret analysis of combinatorial MAB (CMAB) and MP-MAB, two largely disjointed areas in MAB, and the results suggest that this previously ignored connection is worth further investigation.
Decentralized Cooperative Reinforcement Learning with Hierarchical Information Structure
TLDR
This work considers two-agent multi-armed bandits and Markov decision processes with a hierarchical information structure arising in applications to propose simpler and more efficient algorithms that require no coordination or communication.
Decentralized Learning in Online Queuing Systems
TLDR
Cooperative queues are considered and the first learning decentralized algorithm guaranteeing stability of the system as long as the ratio of rates is larger than 1 is proposed, thus reaching performances comparable to centralized strategies.
Collaborative Pure Exploration in Kernel Bandit
In this paper, we formulate a Collaborative Pure Exploration in Kernel Bandit problem (CoPE-KB), which provides a novel model for multi-agent multi-task decision making under limited communication
Bandit Learning in Decentralized Matching Markets
TLDR
This model extends the standard stochastic multi-armed bandit framework to a decentralized multiple player setting with competition and introduces a new algorithm for this setting that attains stable regret when preferences of the arms over players are shared.

References

SHOWING 1-10 OF 12 REFERENCES
Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without
TLDR
The first $\sqrt{T}$-type regret guarantee for this problem is proved, under the feedback model where collisions are announced to the colliding players, and it is proved that $T-m$ is the number of players.
Distributed Learning in Multi-Armed Bandit With Multiple Players
  • K. Liu, Qing Zhao
  • Computer Science, Mathematics
    IEEE Transactions on Signal Processing
  • 2010
TLDR
It is shown that the minimum system regret of the decentralized MAB grows with time at the same logarithmic order as in the centralized counterpart where players act collectively as a single entity by exchanging observations and making decisions jointly.
Multiplayer bandits without observing collision information
TLDR
An algorithm for reaching approximate Nash equilibria quickly in stochastic anticoordination games and the first square-root regret bounds that do not depend on the gaps between the means are given.
Multi-Player Bandits - a Musical Chairs Approach
TLDR
This work provides a communication-free algorithm (Musical Chairs) which attains constant regret with high probability, as well as a sublinear-regret, communication- free algorithm (Dynamic Musical Ch chairs) for the more difficult setting of players dynamically entering and leaving throughout the game.
Concurrent Bandits and Cognitive Radio Networks
TLDR
An algorithm is proposed that combines an e-greedy learning rule with a collision avoidance mechanism that shows that sub-linear regret can be obtained in this setting and shows dramatic improvement compared to other algorithms for this setting.
SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits
TLDR
A decentralized algorithm is presented that achieves the same performance as a centralized one, contradicting the existing lower bounds for the stochastic multiplayer multi-armed bandit problem and showing that the logarithmic growth of the regret is still achievable for this model with a new algorithm.
Multi-Player Bandits: The Adversarial Case
TLDR
This work designs the first Multi-player Bandit algorithm that provably works in arbitrarily changing environments, where the losses of the arms may even be chosen by an adversary.
Medium access in cognitive radio networks: A competitive multi-armed bandit framework
  • L. Lai, H. Jiang, H. Poor
  • Computer Science
    2008 42nd Asilomar Conference on Signals, Systems and Computers
  • 2008
TLDR
Low complexity medium access protocols are developed which strike an optimal balance between exploration and exploitation in such competitive environments, and the operating points of these low complexity protocols are shown to converge to those of the scenario in which the parameters are known.
Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret
TLDR
This work proposes policies for distributed learning and access which achieve order-optimal cognitive system throughput under self play, i.e., when implemented at all the secondary users, and proposes a policy whose sum regret grows only slightly faster than logarithmic in the number of transmission slots.
Multi-Armed Bandit Learning in IoT Networks: Learning Helps Even in Non-stationary Settings
TLDR
It is proved that intelligent devices in unlicensed bands can use Multi-Armed Bandit (MAB) learning algorithms to improve resource exploitation, and stochastic MAB learning provides a up to 16% gain in term of successful transmission probabilities, and has near optimal performance even in non-stationary and non-i.i.d. settings.
...
1
2
...