Corpus ID: 207852631

Provably Convergent Off-Policy Actor-Critic with Function Approximation

@article{Zhang2019ProvablyCO,
  title={Provably Convergent Off-Policy Actor-Critic with Function Approximation},
  author={Shangtong Zhang and Bo Liu and Hengshuai Yao and Shimon Whiteson},
  journal={ArXiv},
  year={2019},
  volume={abs/1911.04384}
}
We present the first provably convergent off-policy actor-critic algorithm (COF-PAC) with function approximation in a two-timescale form. Key to COF-PAC is the introduction of a new critic, the emphasis critic, which is trained via Gradient Emphasis Learning (GEM), a novel combination of the key ideas of Gradient Temporal Difference Learning and Emphatic Temporal Difference Learning. With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC… Expand
Improving Sample Complexity Bounds for Actor-Critic Algorithms
TLDR
This study develops several novel techniques for finite-sample analysis of RL algorithms including handling the bias error due to mini-batch Markovian sampling and exploiting the self variance reduction property to improve the convergence analysis of NAC. Expand
A Unified Off-Policy Evaluation Approach for General Value Function
TLDR
A new algorithm called GenTD is proposed for off-policy GVF evaluation and it is shown that GenTD learns multiple interrelated multi-dimensional GVFs as efficiently as a single canonical scalar value function. Expand
Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality
TLDR
This paper develops a doubly robust offpolicy AC (DR-Off-PAC) for discounted MDP, which can take advantage of learned nuisance functions to reduce estimation errors and establishes the first overall sample complexity analysis for a single time-scale off-policy AC algorithm. Expand
A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic
TLDR
These are the first convergence rate results for using nonlinear TTSA algorithms on the concerned class of bilevel optimization problems and it is shown that a two-timescale actor-critic proximal policy optimization algorithm can be viewed as a special case of the framework. Expand
GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values
TLDR
In GradientDICE, a different objective is optimized by using the Perron-Frobenius theorem and eliminating GenDICE's use of divergence, which means nonlinearity in parameterization is not necessary for Gradient DICE, which is provably convergent under linear function approximation. Expand
Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms
TLDR
This is the first theoretical study establishing that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic. Expand

References

SHOWING 1-10 OF 59 REFERENCES
Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation
TLDR
This work presents the first class of policy-gradient algorithms that work with both state-value and policy function-approximation, and are guaranteed to converge to its optimal solution, while maintaining all the desirable properties of classical Actor-Critic methods with no additional hyper-parameters. Expand
Generalized Off-Policy Actor-Critic
TLDR
The Generalized Off-Policy Policy Gradient Theorem is proved to compute the policy gradient of the counterfactual objective and an emphatic approach is used to get an unbiased sample from this policy gradient, yielding the Generalized off-Policy Actor-Critic (Geoff-PAC) algorithm. Expand
Off-Policy Actor-Critic
TLDR
This paper derives an incremental, linear time and space complexity algorithm that includes eligibility traces, proves convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems. Expand
Policy Gradient Methods for Reinforcement Learning with Function Approximation
TLDR
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy. Expand
An Off-policy Policy Gradient Theorem Using Emphatic Weightings
TLDR
This work develops a new actor-critic algorithm called Actor Critic with Emphatic weightings (ACE) that approximates the simplified gradients provided by the theorem, and demonstrates in a simple counterexample that previous off-policy policy gradient methods Converge to the wrong solution whereas ACE finds the optimal solution. Expand
A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation
TLDR
The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm, and proves that its expected update is in the direction of the gradient, assuring convergence under the usual stoChastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity. Expand
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
TLDR
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods. Expand
Bridging the Gap Between Value and Policy Based Reinforcement Learning
TLDR
A new RL algorithm, Path Consistency Learning (PCL), is developed that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces and significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks. Expand
On Convergence of Emphatic Temporal-Difference Learning
TLDR
This paper presents the first convergence proofs for two emphatic algorithms, ETD and ELSTD, and proves, under general off-policy conditions, the convergence in $L^1$ for ELSTD iterates, and the almost sure convergence of the approximate value functions calculated by both algorithms using a single infinitely long trajectory. Expand
On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning
We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection ofExpand
...
1
2
3
4
5
...