Using Policy Gradients to Account for Changes in Behaviour Policies Using Policy Gradients to Account for Changes in Behaviour Policies under Off-policy Control

Abstract

Off-policy learning refers to the problem of learning the value function of a behaviour, or policy, while selecting actions with a different policy. Gradient-based off-policy learning algorithms, such as GTD (Sutton et al., 2009b) and TDC/GQ (Sutton et al., 2009a), converge when selecting actions with a fixed policy even when using function approximation… (More)

3 Figures and Tables

Topics

  • Presentations referencing similar topics