Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding


On large problems, reinforcement learning systems must use parame-terized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative results in attempting to apply dynamic programming together with function approximation to simple control problems with continuous state spaces. In this paper, we present positive results for all the control tasks they attempted, and for one that is significantly larger. The most important differences are that we used sparse-coarse-coded function approximators (CMACs) whereas they used mostly global function approximators, and that we learned online whereas they learned offline. Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes (" rollouts "), as in classical Monte Carlo methods, and as in the TD(λ) algorithm when λ = 1. However, in our experiments this always resulted in substantially poorer performance. We conclude that reinforcement learning can work robustly in conjunction with function approximators, and that there is little justification at present for avoiding the case of general λ. Reinforcement learning is a broad class of optimal control methods based on estimating Many of these methods, e.g., dynamic programming and temporal-difference learning, build their estimates in part on the basis of other estimates. This may be worrisome because, in practice, the estimates never become exact; on large problems, parameterized function approximators such as neural networks must be used. Because the estimates are imperfect, and because they in turn are used as the targets for other estimates, it seems possible that the ultimate result might be very poor estimates, or even divergence. Indeed some such methods have been shown to be unstable in theory What are the key requirements of a method or task in order to obtain good performance? The experiments in this paper are part of narrowing the answer to this question. The reinforcement learning methods we use are variations of the sarsa algorithm (Rum-, except applied to state-action pairs instead of states, and where the predictions are used as the basis for selecting actions. The learning agent estimates action-values, Q π (s, a), defined as the expected future reward starting in state s, taking action a, and thereafter following policy π. These are estimated …

Extracted Key Phrases

5 Figures and Tables

Showing 1-10 of 471 extracted citations
Citations per Year

922 Citations

Semantic Scholar estimates that this publication has received between 777 and 1,093 citations based on the available data.

See our FAQ for additional information.