Improved Corruption Robust Algorithms for Episodic Reinforcement Learning
@inproceedings{Chen2021ImprovedCR, title={Improved Corruption Robust Algorithms for Episodic Reinforcement Learning}, author={Yifang Chen and Simon Shaolei Du and Kevin G. Jamieson}, booktitle={International Conference on Machine Learning}, year={2021} }
We study episodic reinforcement learning under unknown adversarial corruptions in both the rewards and the transition probabilities of the underlying system. We propose new algorithms which, compared to the existing results in (Lykouris et al., 2020), achieve strictly better regret bounds in terms of total corruptions for the tabular setting. To be specific, firstly, our regret bounds depend on more precise numerical values of total rewards corruptions and transition corruptions, instead of…
12 Citations
Corruption-Robust Offline Reinforcement Learning
- Computer ScienceAISTATS
- 2022
It is shown that a worst-case Ω( Hdε) optimality gap is unavoidable in linear MDP of dimension d, even if the adversary only corrupts the reward element in a tuple, and implies that corruption-robust offline RL is a strictly harder problem.
A Model Selection Approach for Corruption Robust Reinforcement Learning
- Computer ScienceALT
- 2022
A model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward and can be easily applied to other settings including linear bandits, linear contextual bandits, and MDPs with general function approximation, leading to several improved or new results.
Policy Resilience to Environment Poisoning Attacks on Reinforcement Learning
- Computer Science
- 2022
This paper investigates policy resilience to training-environment poisoning attacks on reinforcement learning (RL) policies, with the goal of recovering the deployment performance of a poisoned RL policy, and proposes a policy-resilience mechanism based on an idea of knowledge sharing.
Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear Bandit Algorithms
- Computer ScienceArXiv
- 2022
Surprisingly, in this paper, it is shown that the stochastic contextual problem can be solved as if it is a linear bandit problem, and a novel reduction framework is established that converts every stoChastic contextuallinear bandit instance to a linearBandit instance, when the context distribution is known.
Online Policy Optimization for Robust MDP
- Computer ScienceArXiv
- 2022
This work considers online robust MDP by interacting with an unknown nominal system, and proposes a robust optimistic policy optimization algorithm that is provably efficient under a more realistic online setting.
Understanding the Limits of Poisoning Attacks in Episodic Reinforcement Learning
- Computer ScienceIJCAI
- 2022
This paper studies poisoning attacks to manipulate any order-optimal learning algorithm towards a targeted policy in episodic RL and examines the potential damage of two natural types of poisoning attacks, i.e., the manipulation of reward or action.
User-Oriented Robust Reinforcement Learning
- Computer ScienceArXiv
- 2022
A new User-Oriented Robustness (UOR) metric for RL, which allocates different weights to the environments according to user preference and generalizes the max-min robustness metric, is proposed.
Hybrid Regret Bounds for Combinatorial Semi-Bandits and Adversarial Linear Bandits
- Computer ScienceNeurIPS
- 2021
An algorithm for combinatorial semi-bandits with a hybrid regret bound that includes a best-of-three-worlds guarantee and multiple data-dependent regret bounds is proposed, which implies that the algorithm will perform better as long as the environment is "easy" in terms of certain metrics.
On Optimal Robustness to Adversarial Corruption in Online Decision Problems
- Computer Science, MathematicsNeurIPS
- 2021
It is shown that optimal robustness can be expressed by a square-root dependency on the amount of corruption, and two classes of algorithms, anytime Hedge with decreasing learning rate and algorithms with second-order regret bounds, achieve O ( log N ∆ + (cid:113) C log N∆ ) -regret.
Adversarial Parameter Defense by Multi-Step Risk Minimization
- Computer ScienceNeural Networks
- 2021
References
SHOWING 1-10 OF 29 REFERENCES
Stochastic Linear Bandits Robust to Adversarial Attacks
- Computer ScienceAISTATS
- 2021
In a contextual setting, a setup of diverse contexts is revisited, and it is shown that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$.
Better Algorithms for Stochastic Bandits with Adversarial Corruptions
- Computer ScienceCOLT
- 2019
A new algorithm is presented whose regret is nearly optimal, substantially improving upon previous work and can tolerate a significant amount of corruption with virtually no degradation in performance.
Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?
- Computer ScienceArXiv
- 2020
It is proved that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon, and when the values are appropriately normalized, this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense.
Corruption Robust Exploration in Episodic Reinforcement Learning
- Computer ScienceCOLT
- 2021
This work provides the first sublinear regret guarantee which accommodates any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning, and derives results for both tabular and linear-function-approximation settings.
Stochastic bandits robust to adversarial corruptions
- Computer ScienceSTOC
- 2018
We introduce a new model of stochastic bandits with adversarial corruptions which aims to capture settings where most of the input follows a stochastic pattern but some fraction of it can be…
Minimax Regret Bounds for Reinforcement Learning
- Computer ScienceICML
- 2017
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of…
Is Q-learning Provably Efficient?
- Computer ScienceNeurIPS
- 2018
Q-learning with UCB exploration achieves regret in an episodic MDP setting, and this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."
Near-optimal Regret Bounds for Reinforcement Learning
- Computer ScienceJ. Mach. Learn. Res.
- 2008
This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.
Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs
- Computer Science, MathematicsNeurIPS
- 2019
This paper establishes that optimistic algorithms attain gap-dependent and non-asymptotic logarithmic regret for episodic MDPs. In contrast to prior work, our bounds do not suffer a dependence on…
Achieving Near Instance-Optimality and Minimax-Optimality in Stochastic and Adversarial Linear Bandits Simultaneously
- Computer ScienceICML
- 2021
This work develops linear bandit algorithms that automatically adapt to different environments and additionally enjoys minimax-optimal regret in completely adversarial environments, which is the first of this kind to the authors' knowledge.