• Corpus ID: 16073320

Linear Off-Policy Actor-Critic

  title={Linear Off-Policy Actor-Critic},
  author={Thomas Degris and Martha White and Richard S. Sutton},
This paper presents the first actor-critic algorithm for o↵-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in o↵policy gradient temporal-di↵erence learning. O↵-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and… 

Figures from this paper

The Actor-Advisor: Policy Gradient With Off-Policy Advice
This paper proposes an elegant solution, the Actor-Advisor architecture, in which a Policy Gradient actor learns from unbiased Monte-Carlo returns, while being shaped (or advised) by the Softmax policy arising from an off-policy critic.
Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics
This work proposes Bootstrapped Dual Policy Iteration (BDPI), a novel model-free actor-critic reinforcement-learning algorithm for continuous states and discrete actions, with off-policy critics, which is remarkably stable and, contrary to other state-of-the-art algorithms, unusually forgiving for poorly-configured hyper-parameters.
Stable , Practical and On-line Bootstrapped Dual Policy Iteration
This work considers on-line model-free reinforcement learning with discrete actions, and proposes a new non-parametric actor learning rule, and Dual Policy Iteration (DPI), that motivates the use of aggressive off-policy critics.
Relative Importance Sampling For Off-Policy Actor-Critic in Deep Reinforcement Learning
This work presents the first relative importance sampling-off-policy actor-critic (RIS-Off-PAC) model-free algorithms in RL, which uses action value generated from the behavior policy in reward function to train the algorithm rather than from the target policy.
Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch
In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between
On the Study of Cooperative Multi-Agent Policy Gradient
This work investigates the foundations of policy gradient methods within the centralized training for decentralized control (CTDC) paradigm, and establishes policy gradient theorem and compatible function approximations for decentralized multi-agent systems.
Hybrid Policy Gradient for Deep Reinforcement Learning
H-DDPG have higher reward than DDPG by pushes the policy parameters to move in a direction such that the actions with higher reward likely to occur more than the other, and in Hybrid update, the policy gradients are weighted by TD-error.
A Survey on Policy Search Algorithms for Learning Robot Controllers in a Handful of Trials
This article shows that a first strategy is to leverage prior knowledge on the policy structure, which is to create data-driven surrogate models of the expected reward or the dynamical model, so that the policy optimizer queries the model instead of the real system.
Regularized Off-Policy TD-Learning
A novel l1 regularized off-policy convergent TD-learning method, which is able to learn sparse representations of value functions with low computational complexity and low computational cost, is presented.
Introduction to Reinforcement Learning
This chapter introduces the fundamentals of classical reinforcement learning, including the agent, environment, action, and state, as well as the reward function, and introduces the Markov process, which produces the core results used in most reinforcement learning methods: the Bellman equations.


Toward Off-Policy Learning Control with Function Approximation
The Greedy-GQ algorithm is an extension of recent work on gradient temporal-difference learning to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function.
Natural actor-critic algorithms
Natural Actor-Critic
A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation
The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm and it is proved that this algorithm is stable and convergent under the usual stoChastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD's quadratic computational complexity.
Reinforcement Learning in Continuous Time and Space
  • K. Doya
  • Computer Science
    Neural Computation
  • 2000
This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Basedonthe Hamilton-Jacobi-Bellman (HJB)
Policy Gradient Methods for Reinforcement Learning with Function Approximation
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989), showing that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action- values are represented discretely.
Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
This work presents a Bellman error objective function and two gradient-descent TD algorithms that optimize it, and proves the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution.
Residual Algorithms: Reinforcement Learning with Function Approximation
Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction
Results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience are presented.