Corpus ID: 232046338

No-Regret Reinforcement Learning with Heavy-Tailed Rewards

  title={No-Regret Reinforcement Learning with Heavy-Tailed Rewards},
  author={Vincent Zhuang and Yanan Sui},
Reinforcement learning algorithms typically assume rewards to be sampled from lighttailed distributions, such as Gaussian or bounded. However, a wide variety of realworld systems generate rewards that follow heavy-tailed distributions. We consider such scenarios in the setting of undiscounted reinforcement learning. By constructing a lower bound, we show that the difficulty of learning heavy-tailed rewards asymptotically dominates the difficulty of learning transition probabilities. Leveraging… Expand

Figures and Tables from this paper


Optimal Algorithms for Lipschitz Bandits with Heavy-tailed Rewards
The key idea is to exploit the Lipschitz property of the expected reward function by adaptively discretizing the arm set, and employ upper confidence bound policies with robust mean estimators designed for heavy-tailed distributions. Expand
Reinforcement Learning with Perturbed Rewards
This work develops a robust RL framework that enables agents to learn in noisy environments where only perturbed rewards are observed, and shows that trained policies based on the estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. Expand
Is Q-learning Provably Efficient?
Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typicallyExpand
Reinforcement Learning with a Corrupted Reward Channel
This work formalises this problem as a generalised Markov Decision Problem called Corrupt Reward MDP, and finds that by using randomisation to blunt the agent's optimisation, reward corruption can be partially managed under some assumptions. Expand
Pure Exploration of Multi-Armed Bandits with Heavy-Tailed Payoffs
This paper derives theoretical guarantees for the proposed two bandit algorithms, and demonstrates the effectiveness of two algorithms in pure exploration of MAB with heavy-tailed payoffs in synthetic data and real-world financial data. Expand
Why is Posterior Sampling Better than Optimism for Reinforcement Learning?
An Bayesian expected regret bound for PSRL in finite-horizon episodic Markov decision processes is established, which improves upon the best previous bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm. Expand
Distributionally Robust Reinforcement Learning
This work considers risk-averse exploration in approximate RL setting and proposes the distributionally robust policy iteration scheme that provides lower bound guarantee on state-values and presents a practical algorithm that implements a different exploration strategy that acts conservatively at short-term and explores optimistically in a long-run. Expand
Bandits With Heavy Tail
This paper examines the bandit problem under the weaker assumption that the distributions have moments of order 1 + ε, and derives matching lower bounds that show that the best achievable regret deteriorates when ε <; 1. Expand
Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning
The upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-Horizon dependency of at least $H^3$. Expand
Robust Adversarial Reinforcement Learning
RARL is proposed, where an agent is trained to operate in the presence of a destabilizing adversary that applies disturbance forces to the system and the jointly trained adversary is reinforced - that is, it learns an optimal destabilization policy. Expand