• Corpus ID: 235435953

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

  title={Randomized Exploration for Reinforcement Learning with General Value Function Approximation},
  author={Haque Ishfaq and Qiwen Cui and Viet Huy Nguyen and Alex Ayoub and Zhuoran Yang and Zhaoran Wang and Doina Precup and Lin F. Yang},
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we… 

Figures from this paper

HyperDQN: A Randomized Exploration Method for Deep Reinforcement Learning

A practical algorithm named HyperDQN is presented to address the above issues under deep RL, which outperforms several exploration bonus and randomized exploration methods on 5 out of 9 games and outperforms DQN on the Atari suite.

Anti-Concentrated Confidence Bonuses for Scalable Exploration

A practical variant for deep reinforcement learning that is competitive with contemporary intrinsic reward heuristics on Atari benchmarks is developed, using an ensemble of regressors trained to predict random noise from policy network-derived features.

Understanding Deep Neural Function Approximation in Reinforcement Learning via $\epsilon$-Greedy Exploration

An initial attempt on theoretical understanding deep RL from the perspective of function class and neural networks architectures beyond the “linear” regime and it is proved that, with T episodes, scaling the width m and the depth L of the neural network for deep RL is sufficient for learning with sublinear regret in Besov spaces.

Scalable Exploration for Neural Online Learning to Rank with Perturbed Feedback

This work proposes an efficient exploration strategy for online interactive neural ranker learning based on bootstrapping that eliminates explicit confidence set construction and the associated computational overhead, which enables the online neural rankers training to be efficiently executed in practice with theoretical guarantees.

Improved Algorithms for Misspecified Linear Markov Decision Processes

This work proposes an algorithm with three desirable properties of the misspecified linear Markov decision process (MLMDP) model and provides an intuitive interpretation of their result, which informs the design of the algorithm.

Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration and Planning

We study the problem of episodic reinforcement learning in continuous state-action spaces with unknown rewards and transitions. Specifically, we consider the setting where the rewards and transitions

Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees

An optimistic posterior sampling algorithm for reinforcement learning ( OPSRL), a simple variant of posterior sampling that only needs a number of posterior samples logarithmic in H, S, A, and T per state-action pair, and matches the lower bound of order Ω( √ H 3 SAT ) , thereby answering the open problems raised by Agrawal and Jia [2017b].

Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first

Understanding the Eluder Dimension

For binary-valued function classes, a characterization of the eluder dimension is obtained in terms of star number and threshold dimension, quantities which are relevant in active learning and online learning respectively.



Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

It is proved that the frequentist regret of RLSVI is upper-bounded by $\widetilde O(d^2 H^2 \sqrt{T})$ where d are the feature dimension, H is the horizon, and T is the total number of steps.

Provably Efficient Reinforcement Learning with Linear Function Approximation

This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.

Provably Efficient Exploration in Policy Optimization

This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.

Is Q-learning Provably Efficient?

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically

Generalization and Exploration via Randomized Value Functions

The results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.

(More) Efficient Reinforcement Learning via Posterior Sampling

An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.

Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension

This paper establishes a provably efficient RL algorithm with general value function approximation that achieves a regret bound of $\widetilde{O}(\mathrm{poly}(dH)\sqrt{T})$ and provides a framework to justify the effectiveness of algorithms used in practice.

R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time and formally justifies the ``optimism under uncertainty'' bias used in many RL algorithms.

SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning

SUNRISE is a simple unified ensemble method, which is compatible with various off-policy RL algorithms and significantly improves the performance of existing off-Policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both low-dimensional and high-dimensional environments.

Learning Near Optimal Policies with Low Inherent Bellman Error

We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to