Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms
@article{Kwon2022CoordinatedAA, title={Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms}, author={Jeongyeol Kwon and Yonathan Efroni and Constantine Caramanis and Shie Mannor}, journal={ArXiv}, year={2022}, volume={abs/2201.12700} }
Motivated by online recommendation systems, we propose the problem of finding the optimal policy in multitask contextual bandits when a small fraction α < 1 / 2 of tasks (users) are arbitrary and adversarial. The remaining fraction of good users share the same instance of contextual bandits with S contexts and A actions (items). Naturally, whether a user is good or adversarial is not known in advance. The goal is to robustly learn the policy that maximizes rewards for good users with as few…
2 Citations
Tractable Optimality in Episodic Latent MABs
- Computer ScienceArXiv
- 2022
This work shows that learning with polynomial samples in A is possible, and designs a procedure that provably learns a near-optimal policy with O (poly( A )+poly( M, H ) min(M,H ) ) interactions, and can formulate the moment-matching via maximum likelihood estimation.
Reward-Mixing MDPs with a Few Latent Contexts are Learnable
- Computer ScienceArXiv
- 2022
This work provides a lower bound of ( SA ) Ω( √ M /ǫ 2 for a general instance of RMMDP, supporting that super-polynomial sample complexity in M is necessary.
References
SHOWING 1-10 OF 47 REFERENCES
Stochastic bandits robust to adversarial corruptions
- Computer ScienceSTOC
- 2018
We introduce a new model of stochastic bandits with adversarial corruptions which aims to capture settings where most of the input follows a stochastic pattern but some fraction of it can be…
The Best of Both Worlds: Stochastic and Adversarial Bandits
- Computer ScienceCOLT
- 2012
SAO (Stochastic and Adversarial Optimal) combines the O( √ n) worst-case regret of Exp3 and the (poly)logarithmic regret of UCB1 for stochastic rewards for adversarial rewards.
Contextual Multi-Armed Bandits
- Computer ScienceAISTATS
- 2010
A lower bound is proved for the regret of any algo- rithm where ~ ~ are packing dimensions of the query spaces and the ad space respectively and this gives an almost matching up- per and lower bound for finite spaces or convex bounded subsets of Eu- clidean spaces.
Cooperative Stochastic Multi-agent Multi-armed Bandits Robust to Adversarial Corruptions
- Computer ScienceArXiv
- 2021
This work proposes a new algorithm that not only achieves near-optimal regret in the stochastic setting, but also obtains a regret with an additive term of corruption in the corrupted setting, while maintaining efficient communication.
Collaborative Learning and Personalization in Multi-Agent Stochastic Linear Bandits
- Computer ScienceArXiv
- 2021
It is shown that, an agent i whose parameter deviates from the population average by i, attains a regret scaling of Õ( i √ T ), which demonstrates that if the user representations are close, the resulting regret is low, and vice-versa.
Impact of Representation Learning in Linear Bandits
- Computer ScienceICLR
- 2021
A new algorithm is presented which achieves where N is the number of rounds the authors play for each bandit, and an Ω( T √ kN + √ dkNT ) regret lower bound is provided, showing that the algorithm is minimax-optimal up to poly-logarithmic factors.
RL for Latent MDPs: Regret Guarantees and a Lower Bound
- Computer ScienceNeurIPS
- 2021
This work considers the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP) and shows that the key link is a notion of separation between the MDP system dynamics, providing an efficient algorithm with local guarantee.
Low-rank Bandits with Latent Mixtures
- Computer ScienceArXiv
- 2016
The first rigorous regret analysis of the Robust Tensor Power Method and OFUL linear bandit algorithm is provided, showing that its regret after T user interactions is˜O(C √ BT), with B the number of users.
Multi-Task Learning for Contextual Bandits
- Computer ScienceNIPS
- 2017
An upper confidence bound-based multi-task learning algorithm for contextual bandits is proposed, a corresponding regret bound is established, and this bound is interpreted to quantify the advantages of learning in the presence of high task (arm) similarity.
Minimax Policies for Adversarial and Stochastic Bandits.
- Computer Science, MathematicsCOLT 2009
- 2009
This work fills in a long open gap in the characterization of the minimax rate for the multi-armed bandit prob- lem and proposes a new family of randomized algorithms based on an implicit normalization, as well as a new analysis.