Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic


Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is the high sample complexity of such methods. Unbiased batch policy-gradient methods offer stable learning, but at the cost of high variance, which often requires large batches, while TD… (More)


5 Figures and Tables


Citations per Year

72 Citations

Semantic Scholar estimates that this publication has 72 citations based on the available data.

See our FAQ for additional information.

Blog articles referencing this paper