• Corpus ID: 1998270

Better Optimism By Bayes: Adaptive Planning with Rich Models

@article{Guez2014BetterOB,
  title={Better Optimism By Bayes: Adaptive Planning with Rich Models},
  author={Arthur Guez and David Silver and Peter Dayan},
  journal={ArXiv},
  year={2014},
  volume={abs/1402.1958}
}
The computational costs of inference and planning have confined Bayesian model-based reinforcement learning to one of two dismal fates: powerful Bayes-adaptive planning but only for simplistic models, or powerful, Bayesian non-parametric models but using simple, myopic planning strategies such as Thompson sampling. We ask whether it is feasible and truly beneficial to combine rich probabilistic models with a closer approximation to fully Bayesian planning. First, we use a collection of… 

Figures from this paper

Bayesian Optimal Control of Smoothly Parameterized Systems

TLDR
A lazy version of the so-called posterior sampling method, a method that goes back to Thompson and Strens, that allows for a single algorithm and a single analysis for a wide range of problems, such as finite MDPs or linear quadratic regulation.

Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

TLDR
An Bayesian expected regret bound for PSRL in finite-horizon episodic Markov decision processes is established, which improves upon the best previous bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm.

Bayesian Optimal Control of Smoothly Parameterized Systems: The Lazy Posterior Sampling Algorithm

TLDR
An algorithm that trades off performance for computational efficiency is designed, a lazy posterior sampling method that maintains a distribution over the unknown parameter and is shown to be effective on a web server control application.

Information theoretic learning methods for Markov decision processes with parametric uncertainty

TLDR
This dissertation contributes to this area by building a rigorous framework rooted in information theory for solving MDPs with model uncertainty, which builds a problem formulation entirely grounded in system and informational dynamics without the use of ad-hoc heuristics.

Reinforcement learning approaches to the analysis of the emergence of goal-directed behaviour

TLDR
It is concluded that computational descriptions of the developing decision making functions provide one plausible avenue by which to normatively characterize and define the functions that control action selection.

Optimal treatment allocations in space and time for on‐line control of an emerging infectious disease

TLDR
A Bayesian on-line estimator of the optimal allocation strategy that combines simulation-optimization with Thompson sampling is derived and performs favourably in simulation experiments.

Automated experiment design for drug development

TLDR
Results show that the bandits perform signi cantly better than random selection and that the feature compression probably does not decrease the overall accuracy of the predictions.

References

SHOWING 1-10 OF 35 REFERENCES

Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search

TLDR
This paper introduces a tractable, sample-based method for approximate Bayes-optimal planning which exploits Monte-Carlo tree search and enables it to outperform previous Bayesian model-based reinforcement learning algorithms by a significant margin on several well-known benchmark problems.

Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

TLDR
This paper introduces a tractable, sample-based method for approximate Bayes-optimal planning which exploits Monte-Carlo tree search and shows it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration.

A Bayesian Sampling Approach to Exploration in Reinforcement Learning

TLDR
This work presents a modular approach to reinforcement learning that uses a Bayesian representation of the uncertainty over models and achieves near-optimal reward with high probability with a sample complexity that is low relative to the speed at which the posterior distribution converges during learning.

Model-Based Bayesian Reinforcement Learning in Large Structured Domains

TLDR
A Bayesian framework for learning the structure and parameters of a dynamical system, while also simultaneously planning a (near-)optimal sequence of actions is proposed.

Bayesian sparse sampling for on-line reward optimization

TLDR
The idea is to grow a sparse lookahead tree, intelligently, by exploiting information in a Bayesian posterior---rather than enumerate action branches (standard sparse sampling) or compensate myopically (value of perfect information).

Bayesian Policy Search with Policy Priors

TLDR
This work casts Markov Chain Monte Carlo as a stochastic, hill-climbing policy search algorithm that learns to learn a structured policy efficiently and shows how inference over the latent variables in these policy priors enables intra- and intertask transfer of abstract knowledge.

A Bayesian Framework for Reinforcement Learning

TLDR
It is proposed that the learning process estimates online the full posterior distribution over models and to determine behavior, a hypothesis is sampled from this distribution and the greedy policy with respect to the hypothesis is obtained by dynamic programming.

(More) Efficient Reinforcement Learning via Posterior Sampling

TLDR
An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.

Bandit Based Monte-Carlo Planning

TLDR
A new algorithm is introduced, UCT, that applies bandit ideas to guide Monte-Carlo planning and is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling.

Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration

TLDR
An implementation of model-based online reinforcement learning for continuous domains with deterministic transitions that is specifically designed to achieve low sample complexity is presented and the "optimism in the face of uncertainty" principle is implemented.