• Corpus ID: 235435953

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

  title={Randomized Exploration for Reinforcement Learning with General Value Function Approximation},
  author={Haque Ishfaq and Qiwen Cui and Viet Huy Nguyen and Alex Ayoub and Zhuoran Yang and Zhaoran Wang and Doina Precup and Lin F. Yang},
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we… 

Figures from this paper

Anti-Concentrated Confidence Bonuses for Scalable Exploration
This work introduces anti-concentrated confidence bounds for efficiently approximating the elliptical bonus, and develops a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic reward heuristics on Atari benchmarks.
Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP
  • Liyu Chen, Rahul Jain, Haipeng Luo
  • Computer Science
  • 2021
We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first
Improved Algorithms for Misspecified Linear Markov Decision Processes
This work proposes an algorithm with three desirable properties of misspecified linear Markov decision process (MLMDP) and generalizes and refines the Sup-Lin-UCB algorithm, which Takemura et al. recently showed satisfies (P3) in the contextual bandit setting.


Provably Efficient Reinforcement Learning with Linear Function Approximation
This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.
Is Q-learning Provably Efficient?
Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically
Generalization and Exploration via Randomized Value Functions
The results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.
(More) Efficient Reinforcement Learning via Posterior Sampling
An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.
R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning
R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time and formally justifies the ``optimism under uncertainty'' bias used in many RL algorithms.
Near-optimal Regret Bounds for Reinforcement Learning
This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.
Optimism in Reinforcement Learning with Generalized Linear Function Approximation
This work designs a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation that enjoys a regret bound of $\tilde{O}(\sqrt{d^3 T})$ where d is the dimensionality of the state-action features and T is the number of episodes.
Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches
Focusing on the special case of factored MDPs, this work proves an exponential lower bound for a general class of model-free approaches, including OLIVE, which, when combined with the algorithmic results, demonstrates exponential separation between model-based and model- free RL in some rich-observation settings.
Model-based Reinforcement Learning and the Eluder Dimension
This work shows that, if the MDP can be parameterized within some known function class, it can obtain regret bounds that scale with the dimensionality, rather than cardinality, of the system.
Worst-Case Regret Bounds for Exploration via Randomized Value Functions
By providing a worst-case regret bound for tabular finite-horizon Markov decision processes, it is shown that planning with respect to randomized value functions can induce provably efficient exploration.