Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms
@article{Kwon2022CoordinatedAA, title={Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms}, author={Jeongyeol Kwon and Yonathan Efroni and Constantine Caramanis and Shie Mannor}, journal={ArXiv}, year={2022}, volume={abs/2201.12700} }
Motivated by online recommendation systems, we propose the problem of finding the optimal policy in multitask contextual bandits when a small fraction α < 1 / 2 of tasks (users) are arbitrary and adversarial. The remaining fraction of good users share the same instance of contextual bandits with S contexts and A actions (items). Naturally, whether a user is good or adversarial is not known in advance. The goal is to robustly learn the policy that maximizes rewards for good users with as few…
2 Citations
Tractable Optimality in Episodic Latent MABs
- Computer ScienceArXiv
- 2022
This work shows that learning with polynomial samples in A is possible, and designs a procedure that provably learns a near-optimal policy with O (poly( A )+poly( M, H ) min(M,H ) ) interactions, and can formulate the moment-matching via maximum likelihood estimation.
Reward-Mixing MDPs with a Few Latent Contexts are Learnable
- Computer ScienceArXiv
- 2022
This work provides a lower bound of ( SA ) Ω( √ M /ǫ 2 for a general instance of RMMDP, supporting that super-polynomial sample complexity in M is necessary.
References
SHOWING 1-10 OF 47 REFERENCES
Stochastic bandits robust to adversarial corruptions
- Computer ScienceSTOC
- 2018
We introduce a new model of stochastic bandits with adversarial corruptions which aims to capture settings where most of the input follows a stochastic pattern but some fraction of it can be…
The Best of Both Worlds: Stochastic and Adversarial Bandits
- Computer ScienceCOLT
- 2012
SAO (Stochastic and Adversarial Optimal) combines the O( √ n) worst-case regret of Exp3 and the (poly)logarithmic regret of UCB1 for stochastic rewards for adversarial rewards.
Corruption Robust Exploration in Episodic Reinforcement Learning
- Computer ScienceCOLT
- 2021
This work provides the first sublinear regret guarantee which accommodates any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning, and derives results for both tabular and linear-function-approximation settings.
Contextual Multi-Armed Bandits
- Computer ScienceAISTATS
- 2010
A lower bound is proved for the regret of any algo- rithm where ~ ~ are packing dimensions of the query spaces and the ad space respectively and this gives an almost matching up- per and lower bound for finite spaces or convex bounded subsets of Eu- clidean spaces.
Cooperative Stochastic Multi-agent Multi-armed Bandits Robust to Adversarial Corruptions
- Computer ScienceArXiv
- 2021
This work proposes a new algorithm that not only achieves near-optimal regret in the stochastic setting, but also obtains a regret with an additive term of corruption in the corrupted setting, while maintaining efficient communication.
Collaborative Learning and Personalization in Multi-Agent Stochastic Linear Bandits
- Computer ScienceArXiv
- 2021
It is shown that, an agent i whose parameter deviates from the population average by i, attains a regret scaling of Õ( i √ T ), which demonstrates that if the user representations are close, the resulting regret is low, and vice-versa.
Impact of Representation Learning in Linear Bandits
- Computer ScienceICLR
- 2021
A new algorithm is presented which achieves where N is the number of rounds the authors play for each bandit, and an Ω( T √ kN + √ dkNT ) regret lower bound is provided, showing that the algorithm is minimax-optimal up to poly-logarithmic factors.
RL for Latent MDPs: Regret Guarantees and a Lower Bound
- Computer ScienceNeurIPS
- 2021
This work considers the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP) and shows that the key link is a notion of separation between the MDP system dynamics, providing an efficient algorithm with local guarantee.
Low-rank Bandits with Latent Mixtures
- Computer ScienceArXiv
- 2016
The first rigorous regret analysis of the Robust Tensor Power Method and OFUL linear bandit algorithm is provided, showing that its regret after T user interactions is˜O(C √ BT), with B the number of users.
Multi-Task Learning for Contextual Bandits
- Computer ScienceNIPS
- 2017
An upper confidence bound-based multi-task learning algorithm for contextual bandits is proposed, a corresponding regret bound is established, and this bound is interpreted to quantify the advantages of learning in the presence of high task (arm) similarity.