- Published 2003 in IJCAI

We address the problem of optimally controlling stochastic environments that are partially observable. The standard method for tackling such problems is to define and solve a Partially Observable Markov Decision Process (POMDP). However, it is well known that exactly solving POMDPs is very costly computationally. Recently, Littman, Sutton and Singh (2002) have proposed an alternative representation of partially observable environments, called predictive state representations (PSRs). PSRs are grounded in the sequence of actions and observations of the agent, and hence relate the state representation directly to the agent's experience. In this paper, we present a policy iteration algorithm for finding policies using PSRs. In preliminary experiments, our algorithm produced good solutions. 1 Predictive State Representation We assume that we are given a system consisting of a discrete, finite set of n states 5, a discrete finite set of actions A, and a discrete finite set of observations O. The interaction with the system takes place at discrete time intervals. The initial state of the system so is drawn from an initial probability distribution over states I. On every time step t, an action at is chosen according to some policy. Then the underlying state changes to and a next observation 0i+1 is generated. The system is Markovian, in the sense that for every action, the transition to the next state is generated according to a probability distribution described by an (n x n) transition matrix Similarly, for a given observation o and action a, the next observation is generated according to an (n x n) diagonal observation matrix where is the probability of observation o when action a is selected and state i is reached. Since we are interested in optimal control, rather than prediction, we also assume that there exists a set of reward vectors for each action a, where is the reward for taking action a in underlying state i PSRs are based on the notion of tests. A test is an ordered sequence of action-observation pairs q = The prediction for test q is the probability of the sequence of observations being generated, given the sequence of actions a1...ak. The prediction for a test q given prior history //-, denoted is the probability of seeing the sequence of observations of q after seeing history h and taking the sequence of actions specified by q. For any set of tests Q, its prediction vector is: A set of tests Q is a PSR if its prediction vector forms a sufficient statistic for the dynamical system, i.e., if all tests can be predicted based on p(Q|h). Of particular interest is the case of linear PSRs, in which there exists a projection vector mq for any test q such that Littman et al. also define an outcome function u mapping tests into n-dimensional vectors defined recursively by: and u(aoq) = where e represents a null test and cn is the (1 x n) vector of all Is. Each component u, (q) indicates the probability of the test q when its sequence of actions is applied from state st. A set of tests Q = = 1,2,..A;} is called linearly independent if the outcome vectors of its tests arc linearly independent. Using this definition, such a set Q can be found by a simple search algorithm in polynomial time, given the POMDP model of the environment. Littman, Sutton and Singh (2002) showed that the outcome vectors of the tests in Q can be linearly combined to produce the outcome vector for any test. 2 Policy evaluation using PSRs We assume that we are given a policy and that the initial start state of the system, is drawn according to the staring probability distribution I. If we consider a given horizon t, only a finite number of tests of length t are possible when starting from I. Let be this set of possible tests. The value of a memoryless policy with respect to a given start state distribution I is the expected return over all possible tests that can occur when the starting state is drawn form I and then behavior is generated according to policy where is the expected return for test q given that the initial state is drawn from / and policy is followed.

@inproceedings{Izadi2003APA,
title={A Planning Algorithm for Predictive State Representations},
author={Masoumeh T. Izadi and Doina Precup},
booktitle={IJCAI},
year={2003}
}