Corpus ID: 232046338

No-Regret Reinforcement Learning with Heavy-Tailed Rewards

@inproceedings{Zhuang2021NoRegretRL,
title={No-Regret Reinforcement Learning with Heavy-Tailed Rewards},
author={Vincent Zhuang and Yanan Sui},
booktitle={AISTATS},
year={2021}
}
• Published in AISTATS 25 February 2021
• Computer Science, Mathematics
Reinforcement learning algorithms typically assume rewards to be sampled from lighttailed distributions, such as Gaussian or bounded. However, a wide variety of realworld systems generate rewards that follow heavy-tailed distributions. We consider such scenarios in the setting of undiscounted reinforcement learning. By constructing a lower bound, we show that the difficulty of learning heavy-tailed rewards asymptotically dominates the difficulty of learning transition probabilities. Leveraging… Expand

References

SHOWING 1-10 OF 37 REFERENCES
Optimal Algorithms for Lipschitz Bandits with Heavy-tailed Rewards
• Mathematics, Computer Science
• ICML
• 2019
The key idea is to exploit the Lipschitz property of the expected reward function by adaptively discretizing the arm set, and employ upper confidence bound policies with robust mean estimators designed for heavy-tailed distributions. Expand
Reinforcement Learning with Perturbed Rewards
This work develops a robust RL framework that enables agents to learn in noisy environments where only perturbed rewards are observed, and shows that trained policies based on the estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. Expand
Is Q-learning Provably Efficient?
• Computer Science, Mathematics
• NeurIPS
• 2018
Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typicallyExpand
Reinforcement Learning with a Corrupted Reward Channel
• Computer Science, Mathematics
• IJCAI
• 2017
This work formalises this problem as a generalised Markov Decision Problem called Corrupt Reward MDP, and finds that by using randomisation to blunt the agent's optimisation, reward corruption can be partially managed under some assumptions. Expand
Pure Exploration of Multi-Armed Bandits with Heavy-Tailed Payoffs
• Computer Science
• UAI
• 2018
This paper derives theoretical guarantees for the proposed two bandit algorithms, and demonstrates the effectiveness of two algorithms in pure exploration of MAB with heavy-tailed payoffs in synthetic data and real-world financial data. Expand
Why is Posterior Sampling Better than Optimism for Reinforcement Learning?
• Mathematics, Computer Science
• ICML
• 2017
An Bayesian expected regret bound for PSRL in finite-horizon episodic Markov decision processes is established, which improves upon the best previous bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm. Expand
Distributionally Robust Reinforcement Learning
• Computer Science, Mathematics
• ArXiv
• 2019
This work considers risk-averse exploration in approximate RL setting and proposes the distributionally robust policy iteration scheme that provides lower bound guarantee on state-values and presents a practical algorithm that implements a different exploration strategy that acts conservatively at short-term and explores optimistically in a long-run. Expand
Bandits With Heavy Tail
• Mathematics, Computer Science
• IEEE Transactions on Information Theory
• 2013
This paper examines the bandit problem under the weaker assumption that the distributions have moments of order 1 + ε, and derives matching lower bounds that show that the best achievable regret deteriorates when ε <; 1. Expand
Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning
• Computer Science, Mathematics
• NIPS
• 2015
The upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-Horizon dependency of at least $H^3$. Expand