• Corpus ID: 231925038

Improved Corruption Robust Algorithms for Episodic Reinforcement Learning

  title={Improved Corruption Robust Algorithms for Episodic Reinforcement Learning},
  author={Yifang Chen and Simon Shaolei Du and Kevin G. Jamieson},
  booktitle={International Conference on Machine Learning},
We study episodic reinforcement learning under unknown adversarial corruptions in both the rewards and the transition probabilities of the underlying system. We propose new algorithms which, compared to the existing results in (Lykouris et al., 2020), achieve strictly better regret bounds in terms of total corruptions for the tabular setting. To be specific, firstly, our regret bounds depend on more precise numerical values of total rewards corruptions and transition corruptions, instead of… 

Corruption-Robust Offline Reinforcement Learning

It is shown that a worst-case Ω( Hdε) optimality gap is unavoidable in linear MDP of dimension d, even if the adversary only corrupts the reward element in a tuple, and implies that corruption-robust offline RL is a strictly harder problem.

A Model Selection Approach for Corruption Robust Reinforcement Learning

A model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward and can be easily applied to other settings including linear bandits, linear contextual bandits, and MDPs with general function approximation, leading to several improved or new results.

Policy Resilience to Environment Poisoning Attacks on Reinforcement Learning

This paper investigates policy resilience to training-environment poisoning attacks on reinforcement learning (RL) policies, with the goal of recovering the deployment performance of a poisoned RL policy, and proposes a policy-resilience mechanism based on an idea of knowledge sharing.

Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear Bandit Algorithms

Surprisingly, in this paper, it is shown that the stochastic contextual problem can be solved as if it is a linear bandit problem, and a novel reduction framework is established that converts every stoChastic contextuallinear bandit instance to a linearBandit instance, when the context distribution is known.

Online Policy Optimization for Robust MDP

This work considers online robust MDP by interacting with an unknown nominal system, and proposes a robust optimistic policy optimization algorithm that is provably efficient under a more realistic online setting.

Understanding the Limits of Poisoning Attacks in Episodic Reinforcement Learning

This paper studies poisoning attacks to manipulate any order-optimal learning algorithm towards a targeted policy in episodic RL and examines the potential damage of two natural types of poisoning attacks, i.e., the manipulation of reward or action.

User-Oriented Robust Reinforcement Learning

A new User-Oriented Robustness (UOR) metric for RL, which allocates different weights to the environments according to user preference and generalizes the max-min robustness metric, is proposed.

Hybrid Regret Bounds for Combinatorial Semi-Bandits and Adversarial Linear Bandits

An algorithm for combinatorial semi-bandits with a hybrid regret bound that includes a best-of-three-worlds guarantee and multiple data-dependent regret bounds is proposed, which implies that the algorithm will perform better as long as the environment is "easy" in terms of certain metrics.

On Optimal Robustness to Adversarial Corruption in Online Decision Problems

  • Shinji Ito
  • Computer Science, Mathematics
  • 2021
It is shown that optimal robustness can be expressed by a square-root dependency on the amount of corruption, and two classes of algorithms, anytime Hedge with decreasing learning rate and algorithms with second-order regret bounds, achieve O ( log N ∆ + (cid:113) C log N∆ ) -regret.



Stochastic Linear Bandits Robust to Adversarial Attacks

In a contextual setting, a setup of diverse contexts is revisited, and it is shown that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$.

Better Algorithms for Stochastic Bandits with Adversarial Corruptions

A new algorithm is presented whose regret is nearly optimal, substantially improving upon previous work and can tolerate a significant amount of corruption with virtually no degradation in performance.

Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

It is proved that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon, and when the values are appropriately normalized, this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense.

Corruption Robust Exploration in Episodic Reinforcement Learning

This work provides the first sublinear regret guarantee which accommodates any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning, and derives results for both tabular and linear-function-approximation settings.

Stochastic bandits robust to adversarial corruptions

We introduce a new model of stochastic bandits with adversarial corruptions which aims to capture settings where most of the input follows a stochastic pattern but some fraction of it can be

Minimax Regret Bounds for Reinforcement Learning

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of

Is Q-learning Provably Efficient?

Q-learning with UCB exploration achieves regret in an episodic MDP setting, and this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."

Near-optimal Regret Bounds for Reinforcement Learning

This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.

Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

This paper establishes that optimistic algorithms attain gap-dependent and non-asymptotic logarithmic regret for episodic MDPs. In contrast to prior work, our bounds do not suffer a dependence on

Achieving Near Instance-Optimality and Minimax-Optimality in Stochastic and Adversarial Linear Bandits Simultaneously

This work develops linear bandit algorithms that automatically adapt to different environments and additionally enjoys minimax-optimal regret in completely adversarial environments, which is the first of this kind to the authors' knowledge.