# Randomized Exploration for Reinforcement Learning with General Value Function Approximation

@article{Ishfaq2021RandomizedEF, title={Randomized Exploration for Reinforcement Learning with General Value Function Approximation}, author={Haque Ishfaq and Qiwen Cui and Viet Huy Nguyen and Alex Ayoub and Zhuoran Yang and Zhaoran Wang and Doina Precup and Lin F. Yang}, journal={ArXiv}, year={2021}, volume={abs/2106.07841} }

We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we…

## 3 Citations

Anti-Concentrated Confidence Bonuses for Scalable Exploration

- Computer ScienceArXiv
- 2021

This work introduces anti-concentrated confidence bounds for efficiently approximating the elliptical bonus, and develops a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic reward heuristics on Atari benchmarks.

Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

- Computer ScienceArXiv
- 2021

We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first…

Improved Algorithms for Misspecified Linear Markov Decision Processes

- Computer Science, MathematicsArXiv
- 2021

This work proposes an algorithm with three desirable properties of misspecified linear Markov decision process (MLMDP) and generalizes and refines the Sup-Lin-UCB algorithm, which Takemura et al. recently showed satisfies (P3) in the contextual bandit setting.

## References

SHOWING 1-10 OF 37 REFERENCES

Provably Efficient Reinforcement Learning with Linear Function Approximation

- Computer Science, MathematicsCOLT
- 2020

This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.

Is Q-learning Provably Efficient?

- Computer Science, MathematicsNeurIPS
- 2018

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically…

Generalization and Exploration via Randomized Value Functions

- Mathematics, Computer ScienceICML
- 2016

The results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.

(More) Efficient Reinforcement Learning via Posterior Sampling

- Computer Science, MathematicsNIPS
- 2013

An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.

R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

- Mathematics, Computer ScienceJ. Mach. Learn. Res.
- 2002

R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time and formally justifies the ``optimism under uncertainty'' bias used in many RL algorithms.

Near-optimal Regret Bounds for Reinforcement Learning

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2008

This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.

Optimism in Reinforcement Learning with Generalized Linear Function Approximation

- Computer Science, MathematicsICLR
- 2021

This work designs a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation that enjoys a regret bound of $\tilde{O}(\sqrt{d^3 T})$ where d is the dimensionality of the state-action features and T is the number of episodes.

Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches

- Computer ScienceCOLT
- 2019

Focusing on the special case of factored MDPs, this work proves an exponential lower bound for a general class of model-free approaches, including OLIVE, which, when combined with the algorithmic results, demonstrates exponential separation between model-based and model- free RL in some rich-observation settings.

Model-based Reinforcement Learning and the Eluder Dimension

- Computer Science, MathematicsNIPS
- 2014

This work shows that, if the MDP can be parameterized within some known function class, it can obtain regret bounds that scale with the dimensionality, rather than cardinality, of the system.

Worst-Case Regret Bounds for Exploration via Randomized Value Functions

- Computer Science, MathematicsNeurIPS
- 2019

By providing a worst-case regret bound for tabular finite-horizon Markov decision processes, it is shown that planning with respect to randomized value functions can induce provably efficient exploration.