# Provably Convergent Off-Policy Actor-Critic with Function Approximation

@article{Zhang2019ProvablyCO, title={Provably Convergent Off-Policy Actor-Critic with Function Approximation}, author={Shangtong Zhang and Bo Liu and Hengshuai Yao and Shimon Whiteson}, journal={ArXiv}, year={2019}, volume={abs/1911.04384} }

We present the first provably convergent off-policy actor-critic algorithm (COF-PAC) with function approximation in a two-timescale form. Key to COF-PAC is the introduction of a new critic, the emphasis critic, which is trained via Gradient Emphasis Learning (GEM), a novel combination of the key ideas of Gradient Temporal Difference Learning and Emphatic Temporal Difference Learning. With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC… Expand

#### 6 Citations

Improving Sample Complexity Bounds for Actor-Critic Algorithms

- Computer Science, Mathematics
- ArXiv
- 2020

This study develops several novel techniques for finite-sample analysis of RL algorithms including handling the bias error due to mini-batch Markovian sampling and exploiting the self variance reduction property to improve the convergence analysis of NAC. Expand

A Unified Off-Policy Evaluation Approach for General Value Function

- Computer Science, Mathematics
- ArXiv
- 2021

A new algorithm called GenTD is proposed for off-policy GVF evaluation and it is shown that GenTD learns multiple interrelated multi-dimensional GVFs as efficiently as a single canonical scalar value function. Expand

Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality

- Computer Science, Mathematics
- ICML
- 2021

This paper develops a doubly robust offpolicy AC (DR-Off-PAC) for discounted MDP, which can take advantage of learned nuisance functions to reduce estimation errors and establishes the first overall sample complexity analysis for a single time-scale off-policy AC algorithm. Expand

A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic

- Computer Science, Mathematics
- ArXiv
- 2020

These are the first convergence rate results for using nonlinear TTSA algorithms on the concerned class of bilevel optimization problems and it is shown that a two-timescale actor-critic proximal policy optimization algorithm can be viewed as a special case of the framework. Expand

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values

- Computer Science, Mathematics
- ICML
- 2020

In GradientDICE, a different objective is optimized by using the Perron-Frobenius theorem and eliminating GenDICE's use of divergence, which means nonlinearity in parameterization is not necessary for Gradient DICE, which is provably convergent under linear function approximation. Expand

Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms

- Computer Science
- NeurIPS
- 2020

This is the first theoretical study establishing that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic. Expand

#### References

SHOWING 1-10 OF 59 REFERENCES

Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation

- Computer Science
- ArXiv
- 2018

This work presents the first class of policy-gradient algorithms that work with both state-value and policy function-approximation, and are guaranteed to converge to its optimal solution, while maintaining all the desirable properties of classical Actor-Critic methods with no additional hyper-parameters. Expand

Generalized Off-Policy Actor-Critic

- Computer Science, Mathematics
- NeurIPS
- 2019

The Generalized Off-Policy Policy Gradient Theorem is proved to compute the policy gradient of the counterfactual objective and an emphatic approach is used to get an unbiased sample from this policy gradient, yielding the Generalized off-Policy Actor-Critic (Geoff-PAC) algorithm. Expand

Off-Policy Actor-Critic

- Computer Science
- ICML 2012
- 2012

This paper derives an incremental, linear time and space complexity algorithm that includes eligibility traces, proves convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems. Expand

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

- Computer Science, Mathematics
- NeurIPS
- 2018

This work develops a new actor-critic algorithm called Actor Critic with Emphatic weightings (ACE) that approximates the simplified gradients provided by the theorem, and demonstrates in a simple counterexample that previous off-policy policy gradient methods Converge to the wrong solution whereas ACE finds the optimal solution. Expand

Policy Gradient Methods for Reinforcement Learning with Function Approximation

- Mathematics, Computer Science
- NIPS
- 1999

This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy. Expand

A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

- Computer Science, Mathematics
- NIPS
- 2008

The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm, and proves that its expected update is in the direction of the gradient, assuring convergence under the usual stoChastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity. Expand

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

- Computer Science, Mathematics
- ICML
- 2018

This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods. Expand

Bridging the Gap Between Value and Policy Based Reinforcement Learning

- Computer Science, Mathematics
- NIPS
- 2017

A new RL algorithm, Path Consistency Learning (PCL), is developed that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces and significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks. Expand

On Convergence of Emphatic Temporal-Difference Learning

- Mathematics, Computer Science
- COLT
- 2015

This paper presents the first convergence proofs for two emphatic algorithms, ETD and ELSTD, and proves, under general off-policy conditions, the convergence in $L^1$ for ELSTD iterates, and the almost sure convergence of the approximate value functions calculated by both algorithms using a single infinitely long trajectory. Expand

On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

- Mathematics, Computer Science
- ArXiv
- 2017

We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of… Expand