• Corpus ID: 235435953

# Randomized Exploration for Reinforcement Learning with General Value Function Approximation

@article{Ishfaq2021RandomizedEF,
title={Randomized Exploration for Reinforcement Learning with General Value Function Approximation},
author={Haque Ishfaq and Qiwen Cui and Viet Huy Nguyen and Alex Ayoub and Zhuoran Yang and Zhaoran Wang and Doina Precup and Lin F. Yang},
journal={ArXiv},
year={2021},
volume={abs/2106.07841}
}
• Haque Ishfaq, +5 authors Lin F. Yang
• Published 15 June 2021
• Computer Science, Mathematics
• ArXiv
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we…

## Figures from this paper

Anti-Concentrated Confidence Bonuses for Scalable Exploration
• Computer Science
ArXiv
• 2021
This work introduces anti-concentrated confidence bounds for efficiently approximating the elliptical bonus, and develops a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic reward heuristics on Atari benchmarks.
Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP
• Liyu Chen, Rahul Jain, Haipeng Luo
• Computer Science
ArXiv
• 2021
We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first
Improved Algorithms for Misspecified Linear Markov Decision Processes
• Computer Science, Mathematics
ArXiv
• 2021
This work proposes an algorithm with three desirable properties of misspecified linear Markov decision process (MLMDP) and generalizes and refines the Sup-Lin-UCB algorithm, which Takemura et al. recently showed satisfies (P3) in the contextual bandit setting.

## References

SHOWING 1-10 OF 37 REFERENCES
Provably Efficient Reinforcement Learning with Linear Function Approximation
• Computer Science, Mathematics
COLT
• 2020
This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.
Is Q-learning Provably Efficient?
• Computer Science, Mathematics
NeurIPS
• 2018
Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically
Generalization and Exploration via Randomized Value Functions
• Mathematics, Computer Science
ICML
• 2016
The results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.
(More) Efficient Reinforcement Learning via Posterior Sampling
• Computer Science, Mathematics
NIPS
• 2013
An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.
R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning
• Mathematics, Computer Science
J. Mach. Learn. Res.
• 2002
R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time and formally justifies the optimism under uncertainty'' bias used in many RL algorithms.
Near-optimal Regret Bounds for Reinforcement Learning
• Computer Science, Mathematics
J. Mach. Learn. Res.
• 2008
This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.
Optimism in Reinforcement Learning with Generalized Linear Function Approximation
• Computer Science, Mathematics
ICLR
• 2021
This work designs a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation that enjoys a regret bound of $\tilde{O}(\sqrt{d^3 T})$ where d is the dimensionality of the state-action features and T is the number of episodes.
Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches
• Computer Science
COLT
• 2019
Focusing on the special case of factored MDPs, this work proves an exponential lower bound for a general class of model-free approaches, including OLIVE, which, when combined with the algorithmic results, demonstrates exponential separation between model-based and model- free RL in some rich-observation settings.
Model-based Reinforcement Learning and the Eluder Dimension
• Computer Science, Mathematics
NIPS
• 2014
This work shows that, if the MDP can be parameterized within some known function class, it can obtain regret bounds that scale with the dimensionality, rather than cardinality, of the system.
Worst-Case Regret Bounds for Exploration via Randomized Value Functions
By providing a worst-case regret bound for tabular finite-horizon Markov decision processes, it is shown that planning with respect to randomized value functions can induce provably efficient exploration.