• Corpus ID: 246430299

Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms

@article{Kwon2022CoordinatedAA,
  title={Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms},
  author={Jeongyeol Kwon and Yonathan Efroni and Constantine Caramanis and Shie Mannor},
  journal={ArXiv},
  year={2022},
  volume={abs/2201.12700}
}
Motivated by online recommendation systems, we propose the problem of finding the optimal policy in multitask contextual bandits when a small fraction α < 1 / 2 of tasks (users) are arbitrary and adversarial. The remaining fraction of good users share the same instance of contextual bandits with S contexts and A actions (items). Naturally, whether a user is good or adversarial is not known in advance. The goal is to robustly learn the policy that maximizes rewards for good users with as few… 
2 Citations

Figures from this paper

Tractable Optimality in Episodic Latent MABs

This work shows that learning with polynomial samples in A is possible, and designs a procedure that provably learns a near-optimal policy with O (poly( A )+poly( M, H ) min(M,H ) ) interactions, and can formulate the moment-matching via maximum likelihood estimation.

Reward-Mixing MDPs with a Few Latent Contexts are Learnable

This work provides a lower bound of ( SA ) Ω( √ M /ǫ 2 for a general instance of RMMDP, supporting that super-polynomial sample complexity in M is necessary.

References

SHOWING 1-10 OF 47 REFERENCES

Stochastic bandits robust to adversarial corruptions

We introduce a new model of stochastic bandits with adversarial corruptions which aims to capture settings where most of the input follows a stochastic pattern but some fraction of it can be

The Best of Both Worlds: Stochastic and Adversarial Bandits

SAO (Stochastic and Adversarial Optimal) combines the O( √ n) worst-case regret of Exp3 and the (poly)logarithmic regret of UCB1 for stochastic rewards for adversarial rewards.

Corruption Robust Exploration in Episodic Reinforcement Learning

This work provides the first sublinear regret guarantee which accommodates any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning, and derives results for both tabular and linear-function-approximation settings.

Contextual Multi-Armed Bandits

A lower bound is proved for the regret of any algo- rithm where ~ ~ are packing dimensions of the query spaces and the ad space respectively and this gives an almost matching up- per and lower bound for finite spaces or convex bounded subsets of Eu- clidean spaces.

Cooperative Stochastic Multi-agent Multi-armed Bandits Robust to Adversarial Corruptions

This work proposes a new algorithm that not only achieves near-optimal regret in the stochastic setting, but also obtains a regret with an additive term of corruption in the corrupted setting, while maintaining efficient communication.

Collaborative Learning and Personalization in Multi-Agent Stochastic Linear Bandits

It is shown that, an agent i whose parameter deviates from the population average by i, attains a regret scaling of Õ( i √ T ), which demonstrates that if the user representations are close, the resulting regret is low, and vice-versa.

Impact of Representation Learning in Linear Bandits

A new algorithm is presented which achieves where N is the number of rounds the authors play for each bandit, and an Ω( T √ kN + √ dkNT ) regret lower bound is provided, showing that the algorithm is minimax-optimal up to poly-logarithmic factors.

RL for Latent MDPs: Regret Guarantees and a Lower Bound

This work considers the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP) and shows that the key link is a notion of separation between the MDP system dynamics, providing an efficient algorithm with local guarantee.

Low-rank Bandits with Latent Mixtures

The first rigorous regret analysis of the Robust Tensor Power Method and OFUL linear bandit algorithm is provided, showing that its regret after T user interactions is˜O(C √ BT), with B the number of users.

Multi-Task Learning for Contextual Bandits

An upper confidence bound-based multi-task learning algorithm for contextual bandits is proposed, a corresponding regret bound is established, and this bound is interpreted to quantify the advantages of learning in the presence of high task (arm) similarity.