• Corpus ID: 10513082

Off-Policy Actor-Critic

  title={Off-Policy Actor-Critic},
  author={Thomas Degris and Martha White and Richard S. Sutton},
This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and… 

Figures from this paper

Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm
This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off- Policy behavior and the natural policy gradient for sample efficiency and illustrates the beneflt of the proposed off-Policy natural gradient algorithm by comparing it with the vanilla gradient actor-Critic algorithm on benchmark RL tasks.
Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation
This work presents the first class of policy-gradient algorithms that work with both state-value and policy function-approximation, and are guaranteed to converge to its optimal solution, while maintaining all the desirable properties of classical Actor-Critic methods with no additional hyper-parameters.
Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality
This paper develops a doubly robust offpolicy AC (DR-Off-PAC) for discounted MDP, which can take advantage of learned nuisance functions to reduce estimation errors and establishes the first overall sample complexity analysis for a single time-scale off-policy AC algorithm.
Off-Policy Correction for Actor-Critic Algorithms in Deep Reinforcement Learning
An alternative off-policy correction algorithm for continuous action spaces, Actor-Critic Off-Policy Correction (AC-Off-POC), to mitigate the potential drawbacks introduced by the previously collected data and can attain faster convergence and optimal policies by disregarding the transitions executed by behavioral policies that highly deviate from the current policy.
Guiding Evolutionary Strategies with Off-Policy Actor-Critic
Across a wide range of benchmark control tasks, it is shown that CEM-ACER balances the strengths of CEM and ACER, leading to an algorithm that consistently outperforms its individual building blocks, as well as other competitive baseline algorithms.
Noisy Importance Sampling Actor-Critic: An Off-Policy Actor-Critic With Experience Replay
Noisy Importance Sampling Actor-Critic (NISAC), a set of empirically validated modifications to the advantage actor-critic algorithm (A2C), allowing off-policy reinforcement learning and increased performance.
An Off-policy Policy Gradient Theorem Using Emphatic Weightings
This work develops a new actor-critic algorithm called Actor Critic with Emphatic weightings (ACE) that approximates the simplified gradients provided by the theorem, and demonstrates in a simple counterexample that previous off-policy policy gradient methods Converge to the wrong solution whereas ACE finds the optimal solution.
Combining policy gradient and Q-learning
A new technique is described that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer, and establishing an equivalency between action-value fitting techniques and actor-critic algorithms, showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms.
Parameter-based Value Functions
This work introduces a class of value functions called Parameter-based Value Functions (PVFs) whose inputs include the policy parameters that can generalize across different policies and shows how learned PVFs can zero-shot learn new policies that outperform any policy seen during training.


Toward Off-Policy Learning Control with Function Approximation
The Greedy-GQ algorithm is an extension of recent work on gradient temporal-difference learning to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function.
Natural actor-critic algorithms
Natural Actor-Critic
Least-Squares Policy Iteration
The new algorithm, least-squares policy iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration framework.
Off-policy Learning with Recognizers
A new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods is introduced, and a full algorithm for linear function approximation is developed and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence.
Off-policy Learning with Options and Recognizers
A new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods is introduced and it is proved that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence.
Experiments in Off-policy Reinforcement Learning with the Gq(λ) Algorithm Examining Committee
Overall, this work finds GQ(λ) to be a promising algorithm for use with large real-world continuous learning tasks, and believes it could be the base algorithm of an autonomous sensorimotor robot.
A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation
The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm and it is proved that this algorithm is stable and convergent under the usual stoChastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD's quadratic computational complexity.
Reinforcement Learning in Continuous Time and Space
  • K. Doya
  • Computer Science
    Neural Computation
  • 2000
This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Basedonthe Hamilton-Jacobi-Bellman (HJB)
Policy Gradient Methods for Reinforcement Learning with Function Approximation
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.