Jonathan Baxter

Learn More
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate(More)
A major problem in machine learning is that of inductive bias: how to choose a learner’s hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small enough to ensure reliable generalization from reasonably-sized training sets. Typically such bias is supplied by hand through the skill and insights of experts. In(More)
Much recent attention, both experimental and theoretical, has been focussed on classication algorithms which produce voted combinations of classi ers. Recent theoretical work has shown that the impressive generalization performance of algorithms like AdaBoost can be attributed to the classi er having large margins on the training data. We present an(More)
In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs. The algorithm’s chief(More)
Much recent attention, both experimental and theoretical, has been focussed on classii-cation algorithms which produce voted combinations of classiiers. Recent theoretical work has shown that the impressive generalization performance of algorithms like AdaBoost can be attributed to the classiier having large margins on the training data. We present abstract(More)
A Bayesian model of learning to learn by sampling from multiple tasks is presented. The multiple tasks are themselves generated by sampling from a distribution over an environment of related tasks. Such an environment is shown to be naturally modelled within a Bayesian context by the concept of an objective prior distribution. It is argued that for many(More)
We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learning problems. The first approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sample paths,(More)
Abstract Despite their many empirical successes, approximate value-function based approaches to reinforcement learning suffer from a paucity of theoretical guarantees on the performance of the policy generated by the value-function. In this paper we pursue an alternative approach: first compute the gradient of the average reward with respect to the(More)
Probably the most important, problem in machine learning is the preliminary biasing of a learner’s hypothesis space so that it is small enough to ensure good generalisation from reasonable training sets, yet large enough that it contains a good solution to the problem being learnt, In this paper a mechanism for automailcall,y learning or biasing the(More)