• Corpus ID: 36086749

On Optimistic versus Randomized Exploration in Reinforcement Learning

@article{Osband2017OnOV,
  title={On Optimistic versus Randomized Exploration in Reinforcement Learning},
  author={Ian Osband and Benjamin Van Roy},
  journal={ArXiv},
  year={2017},
  volume={abs/1706.04241}
}
We discuss the relative merits of optimistic and randomized approaches to exploration in reinforcement learning. Optimistic approaches presented in the literature apply an optimistic boost to the value estimate at each state-action pair and select actions that are greedy with respect to the resulting optimistic value function. Randomized approaches sample from among statistically plausible value functions and select actions that are greedy with respect to the random sample. Prior computational… 

Figures and Tables from this paper

Scalable Coordinated Exploration in Concurrent Reinforcement Learning

We consider a team of reinforcement learning agents that concurrently operate in a common environment, and we develop an approach to efficient coordinated exploration that is suitable for problems of

Coordinated Exploration in Concurrent Reinforcement Learning

Simulation results investigate how per-agent regret decreases as the number of agents grows, establishing substantial advantages of seed sampling over alternative exploration schemes.

Time Adaptive Reinforcement Learning

Two model-free, value-based algorithms are introduced that allow a zero-shot adaptation between different time restrictions and represent general mechanisms to handle time adaptive tasks making them compatible with many existing RL methods, algorithms, and scenarios.

A Tutorial on Thompson Sampling

This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes.

Stochastic matrix games with bandit feedback

A version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback, where the players only observe each others actions and a noisy payoff, and it is shown that Thompson fails catastrophically in this setting.

Matrix games with bandit feedback

A version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback, where the players only observe each others actions and a noisy payoff is studied, yielding the surprising result that there is no advantage to knowing your opponent’s strategy in advance if their strategy is optimal.

A Bayesian Nonparametric Approach to Multi-Task Learning for Contextual Bandits in Mobile Health

This work proposes DPMM-Pooling, an integrated intervention algorithm to learn clusters among users and share data within clusters, in order to speed the learning of optimal and individualized treatment policies.

Personalized HeartSteps: A Reinforcement Learning Algorithm for Optimizing Physical Activity

A Reinforcement Learning (RL) algorithm that continuously learns and improves the treatment policy embedded in the JITAI as the data is being collected from the user is developed.

Personalized HeartSteps

A reinforcement learning (RL) algorithm that continuously learns and improves the treatment policy embedded in the JITAI as data is being collected from the user is described.

Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-start Users

The Conversational Thompson Sampling (ConTS) model holistically solves all questions in conversational recommendation by choosing the arm with the maximal reward to play and seamlessly unify attributes and items in the same arm space and achieve their EE trade-offs automatically using the framework ofThompson Sampling.

References

SHOWING 1-10 OF 11 REFERENCES

Generalization and Exploration via Randomized Value Functions

The results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.

(More) Efficient Reinforcement Learning via Posterior Sampling

An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.

Model based Bayesian Exploration

This paper explicitly represents uncertainty about the parameters of the model and build probability distributions over Q-values based on these that are used to compute a myopic approximation to the value of information for each action and hence to select the action that best balances exploration and exploitation.

Learning to Optimize via Posterior Sampling

A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards.

#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning

A simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks, and is found that simple hash functions can achieve surprisingly good results on many challenging tasks.

Eluder Dimension and the Sample Complexity of Optimistic Exploration

A regret bound is developed that holds for both classes of algorithms and applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards.

Deep Exploration via Randomized Value Functions

A regret bound that establishes statistical efficiency with a tabular representation is proved, which offers an elegant means for synthesizing statistically and computationally efficient exploration with common practical approaches to value function learning.

Deep Exploration via Bootstrapped DQN

Efficient exploration in complex environments remains a major challenge for reinforcement learning. We propose bootstrapped DQN, a simple algorithm that explores in a computationally and

Near-optimal Regret Bounds for Reinforcement Learning

This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.

PAC model-free reinforcement learning

This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience, and Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms.