• Corpus ID: 231925038

# Improved Corruption Robust Algorithms for Episodic Reinforcement Learning

@inproceedings{Chen2021ImprovedCR,
title={Improved Corruption Robust Algorithms for Episodic Reinforcement Learning},
author={Yifang Chen and Simon Shaolei Du and Kevin G. Jamieson},
booktitle={International Conference on Machine Learning},
year={2021}
}
• Published in
International Conference on…
13 February 2021
• Computer Science
We study episodic reinforcement learning under unknown adversarial corruptions in both the rewards and the transition probabilities of the underlying system. We propose new algorithms which, compared to the existing results in (Lykouris et al., 2020), achieve strictly better regret bounds in terms of total corruptions for the tabular setting. To be specific, firstly, our regret bounds depend on more precise numerical values of total rewards corruptions and transition corruptions, instead of…
• Computer Science
AISTATS
• 2022
It is shown that a worst-case Ω( Hdε) optimality gap is unavoidable in linear MDP of dimension d, even if the adversary only corrupts the reward element in a tuple, and implies that corruption-robust oﬄine RL is a strictly harder problem.
• Computer Science
ALT
• 2022
A model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward and can be easily applied to other settings including linear bandits, linear contextual bandits, and MDPs with general function approximation, leading to several improved or new results.
This paper investigates policy resilience to training-environment poisoning attacks on reinforcement learning (RL) policies, with the goal of recovering the deployment performance of a poisoned RL policy, and proposes a policy-resilience mechanism based on an idea of knowledge sharing.
• Computer Science
ArXiv
• 2022
Surprisingly, in this paper, it is shown that the stochastic contextual problem can be solved as if it is a linear bandit problem, and a novel reduction framework is established that converts every stoChastic contextuallinear bandit instance to a linearBandit instance, when the context distribution is known.
• Computer Science
ArXiv
• 2022
This work considers online robust MDP by interacting with an unknown nominal system, and proposes a robust optimistic policy optimization algorithm that is provably efﬁcient under a more realistic online setting.
• Computer Science
IJCAI
• 2022
This paper studies poisoning attacks to manipulate any order-optimal learning algorithm towards a targeted policy in episodic RL and examines the potential damage of two natural types of poisoning attacks, i.e., the manipulation of reward or action.
• Computer Science
ArXiv
• 2022
A new User-Oriented Robustness (UOR) metric for RL, which allocates different weights to the environments according to user preference and generalizes the max-min robustness metric, is proposed.
An algorithm for combinatorial semi-bandits with a hybrid regret bound that includes a best-of-three-worlds guarantee and multiple data-dependent regret bounds is proposed, which implies that the algorithm will perform better as long as the environment is "easy" in terms of certain metrics.
• Shinji Ito
• Computer Science, Mathematics
NeurIPS
• 2021
It is shown that optimal robustness can be expressed by a square-root dependency on the amount of corruption, and two classes of algorithms, anytime Hedge with decreasing learning rate and algorithms with second-order regret bounds, achieve O ( log N ∆ + (cid:113) C log N∆ ) -regret.

## References

SHOWING 1-10 OF 29 REFERENCES

• Computer Science
AISTATS
• 2021
In a contextual setting, a setup of diverse contexts is revisited, and it is shown that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$.
• Computer Science
COLT
• 2019
A new algorithm is presented whose regret is nearly optimal, substantially improving upon previous work and can tolerate a significant amount of corruption with virtually no degradation in performance.
• Computer Science
ArXiv
• 2020
It is proved that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon, and when the values are appropriately normalized, this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense.
• Computer Science
COLT
• 2021
This work provides the first sublinear regret guarantee which accommodates any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning, and derives results for both tabular and linear-function-approximation settings.
• Computer Science
STOC
• 2018
We introduce a new model of stochastic bandits with adversarial corruptions which aims to capture settings where most of the input follows a stochastic pattern but some fraction of it can be
• Computer Science
ICML
• 2017
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of
• Computer Science
NeurIPS
• 2018
Q-learning with UCB exploration achieves regret in an episodic MDP setting, and this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."
• Computer Science
J. Mach. Learn. Res.
• 2008
This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.
• Computer Science, Mathematics
NeurIPS
• 2019
This paper establishes that optimistic algorithms attain gap-dependent and non-asymptotic logarithmic regret for episodic MDPs. In contrast to prior work, our bounds do not suffer a dependence on
• Computer Science
ICML
• 2021
This work develops linear bandit algorithms that automatically adapt to different environments and additionally enjoys minimax-optimal regret in completely adversarial environments, which is the first of this kind to the authors' knowledge.