• Corpus ID: 207852631

Provably Convergent Off-Policy Actor-Critic with Function Approximation

  title={Provably Convergent Off-Policy Actor-Critic with Function Approximation},
  author={Shangtong Zhang and Bo Liu and Hengshuai Yao and Shimon Whiteson},
We present the first provably convergent off-policy actor-critic algorithm (COF-PAC) with function approximation in a two-timescale form. Key to COF-PAC is the introduction of a new critic, the emphasis critic, which is trained via Gradient Emphasis Learning (GEM), a novel combination of the key ideas of Gradient Temporal Difference Learning and Emphatic Temporal Difference Learning. With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC… 

Figures from this paper

Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality

This paper develops a doubly robust offpolicy AC (DR-Off-PAC) for discounted MDP, which can take advantage of learned nuisance functions to reduce estimation errors and establishes the first overall sample complexity analysis for a single time-scale off-policy AC algorithm.

Improving Sample Complexity Bounds for Actor-Critic Algorithms

This study develops several novel techniques for finite-sample analysis of RL algorithms including handling the bias error due to mini-batch Markovian sampling and exploiting the self variance reduction property to improve the convergence analysis of NAC.

A Unified Off-Policy Evaluation Approach for General Value Function

A new algorithm called GenTD is proposed for off-policy GVF evaluation and it is shown that GenTD learns multiple interrelated multi-dimensional GVFs as efficiently as a single canonical scalar value function.

Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms

This is the first theoretical study establishing that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic.

A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic

These are the first convergence rate results for using nonlinear TTSA algorithms on the concerned class of bilevel optimization problems and it is shown that a two-timescale actor-critic proximal policy optimization algorithm can be viewed as a special case of the framework.

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values

In GradientDICE, a different objective is optimized by using the Perron-Frobenius theorem and eliminating GenDICE's use of divergence, which means nonlinearity in parameterization is not necessary for Gradient DICE, which is provably convergent under linear function approximation.

Adaptive Interest for Emphatic Reinforcement Learning

Adaptive methods that allow the interest function to dynamically vary over states and iterations are investigated and it is shown that adapting the interest is key to provide significant gains.



Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation

This work presents the first class of policy-gradient algorithms that work with both state-value and policy function-approximation, and are guaranteed to converge to its optimal solution, while maintaining all the desirable properties of classical Actor-Critic methods with no additional hyper-parameters.

Generalized Off-Policy Actor-Critic

The Generalized Off-Policy Policy Gradient Theorem is proved to compute the policy gradient of the counterfactual objective and an emphatic approach is used to get an unbiased sample from this policy gradient, yielding the Generalized off-Policy Actor-Critic (Geoff-PAC) algorithm.

Off-Policy Actor-Critic

This paper derives an incremental, linear time and space complexity algorithm that includes eligibility traces, proves convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.

Policy Gradient Methods for Reinforcement Learning with Function Approximation

This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

This work develops a new actor-critic algorithm called Actor Critic with Emphatic weightings (ACE) that approximates the simplified gradients provided by the theorem, and demonstrates in a simple counterexample that previous off-policy policy gradient methods Converge to the wrong solution whereas ACE finds the optimal solution.

A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm, and proves that its expected update is in the direction of the gradient, assuring convergence under the usual stoChastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity.

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.

Bridging the Gap Between Value and Policy Based Reinforcement Learning

A new RL algorithm, Path Consistency Learning (PCL), is developed that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces and significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks.

On Convergence of Emphatic Temporal-Difference Learning

This paper presents the first convergence proofs for two emphatic algorithms, ETD and ELSTD, and proves, under general off-policy conditions, the convergence in $L^1$ for ELSTD iterates, and the almost sure convergence of the approximate value functions calculated by both algorithms using a single infinitely long trajectory.

On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of