Efficient pac-learning for episodic tasks with acyclic state spaces and the optimal node visitation problem in acyclic stochastic digaphs

This paper considers the problem of computing an optimal policy for a Markov Decision Process (MDP), under lack of complete a priori knowledge of (i) the branching probability distributions determining the evolution of the process state upon the execution of the different actions, and (ii) the probability distributions characterizing the immediate rewards… CONTINUE READING