• Corpus ID: 235293923

# On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction

@inproceedings{Huang2022OnTC,
title={On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction},
author={Jiawei Huang and Nan Jiang},
booktitle={AISTATS},
year={2022}
}
• Published in AISTATS 2 June 2021
• Computer Science
In this paper, we study the convergence properties of off-policy policy optimization algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min problem. We first clearly characterize the bias of the learning objective, and then present two strategies with finite-time convergence guarantees. In our first strategy, we propose an algorithm called P-SREDA with convergence rate O(ε−3), whose dependency on…
2 Citations

## Tables from this paper

Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch
• Mathematics
ArXiv
• 2021
In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between
An Approach for Non-Convex Uniformly Concave Structured Saddle Point Problem
• Economics
• 2022
Московский физико-технический институт, Россия, 141701, Московская обл., г. Долгопрудный, Институтский пер., 9 Национальный исследовательский университет «Высшая школа экономики», Россия, 101000, г.

## References

SHOWING 1-10 OF 49 REFERENCES
Off-Policy Policy Gradient with State Distribution Correction
• Political Science, Economics
UAI 2019
• 2019
This work builds on recent progress for estimating the ratio of the state distributions under behavior and evaluation policies for policy evaluation, and presents an off-policy policy gradient optimization technique that can account for this mismatch in distributions.
Variance-Reduced Off-Policy Memory-Efficient Policy Search
• Computer Science
ArXiv
• 2020
This work proposes an algorithm family that is memory-efficient, stochastically variance-reduced, and capable of learning from off-policy samples, and empirical studies validate the effectiveness of the proposed approaches.
Stochastic Variance Reduction Methods for Policy Evaluation
• Computer Science
ICML
• 2017
This paper first transforms the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then presents a primal-dual batch gradient method, as well as two stochastic variance reduction methods for solving the problem.
• Computer Science
ICML
• 2018
A novel reinforcement- learning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs) with convergence guarantees for SVRPG with a convergence rate that is linear under increasing batch sizes.
An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient
• Computer Science
UAI
• 2019
An improved convergence analysis of SVRPG is provided and it is shown that it can find an $\epsilon$-approximate stationary point of the performance function within $O(1/\ep silon^{5/3})$ trajectories, and sample complexity improves upon the best known result.
Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
• Mathematics
NeurIPS
• 2018
A new off-policy estimation method that applies importance sampling directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators is proposed.
• Computer Science
ICML
• 2020
A class of efficient momentum-based policy gradient methods for the model-free reinforcement learning, which use adaptive learning rates and do not require any large batches and reach the best known sample complexity of $O(\epsilon^{-3})$ without any large batch.
Off-Policy Actor-Critic
• Computer Science
ICML 2012
• 2012
This paper derives an incremental, linear time and space complexity algorithm that includes eligibility traces, proves convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.
Non-asymptotic Convergence Analysis of Two Time-scale (Natural) Actor-Critic Algorithms
• Tengyu Xu, Zhe Wang
• Computer Science
ArXiv
• 2020
Novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling and for analyzing the convergence rate of the linear critic with dynamically changing base functions and transition kernel are developed.
DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
• Computer Science
NeurIPS
• 2019
This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.