• Corpus ID: 235293923

On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction

  title={On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction},
  author={Jiawei Huang and Nan Jiang},
In this paper, we study the convergence properties of off-policy policy optimization algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min problem. We first clearly characterize the bias of the learning objective, and then present two strategies with finite-time convergence guarantees. In our first strategy, we propose an algorithm called P-SREDA with convergence rate O(ε−3), whose dependency on… 
2 Citations

Tables from this paper

Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch
In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between
An Approach for Non-Convex Uniformly Concave Structured Saddle Point Problem
Московский физико-технический институт, Россия, 141701, Московская обл., г. Долгопрудный, Институтский пер., 9 Национальный исследовательский университет «Высшая школа экономики», Россия, 101000, г.


Off-Policy Policy Gradient with State Distribution Correction
This work builds on recent progress for estimating the ratio of the state distributions under behavior and evaluation policies for policy evaluation, and presents an off-policy policy gradient optimization technique that can account for this mismatch in distributions.
Variance-Reduced Off-Policy Memory-Efficient Policy Search
This work proposes an algorithm family that is memory-efficient, stochastically variance-reduced, and capable of learning from off-policy samples, and empirical studies validate the effectiveness of the proposed approaches.
Stochastic Variance Reduction Methods for Policy Evaluation
This paper first transforms the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then presents a primal-dual batch gradient method, as well as two stochastic variance reduction methods for solving the problem.
Stochastic Variance-Reduced Policy Gradient
A novel reinforcement- learning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs) with convergence guarantees for SVRPG with a convergence rate that is linear under increasing batch sizes.
An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient
An improved convergence analysis of SVRPG is provided and it is shown that it can find an $\epsilon$-approximate stationary point of the performance function within $O(1/\ep silon^{5/3})$ trajectories, and sample complexity improves upon the best known result.
Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
A new off-policy estimation method that applies importance sampling directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators is proposed.
Momentum-Based Policy Gradient Methods
A class of efficient momentum-based policy gradient methods for the model-free reinforcement learning, which use adaptive learning rates and do not require any large batches and reach the best known sample complexity of $O(\epsilon^{-3})$ without any large batch.
Off-Policy Actor-Critic
This paper derives an incremental, linear time and space complexity algorithm that includes eligibility traces, proves convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.
Non-asymptotic Convergence Analysis of Two Time-scale (Natural) Actor-Critic Algorithms
Novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling and for analyzing the convergence rate of the linear critic with dynamically changing base functions and transition kernel are developed.
DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.