• Corpus ID: 239016322

# Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

@article{Zhong2021OptimisticPO,
title={Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs},
author={Han Zhong and Zhuoran Yang and Zhaoran Wang and Csaba Szepesv{\'a}ri},
journal={ArXiv},
year={2021},
volume={abs/2110.08984}
}
• Published 18 October 2021
• Computer Science
• ArXiv
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs). In this setting, both the reward function and the transition kernel are linear with respect to the given feature maps and are allowed to vary over time, as long as their respective parameter variations do not exceed certain variation budgets. We propose the periodically restarted optimistic policy optimization algorithm (PROPO), which is an optimistic policy optimization algorithm…

## Figures and Tables from this paper

• Computer Science
ArXiv
• 2022
A dynamic regret bound and a constraint violation bound are established for the proposed algorithm in both the linear kernel CMDP function approximation setting and the tabular CMDP setting under two alternative assumptions.
• Computer Science
ArXiv
• 2020
This work develops the first dynamic regret analysis in nonstationary reinforcement learning with function approximation in episodic Markov decision processes with linear function approximation under drifting environment and proposes a parameter-free algorithm that works without knowing the variation budgets but with a slightly worse dynamic regret bound.
• Computer Science
ICML
• 2022
RPO-SAT is the first computationally computationally efﬁcient, nearly minimax optimal policy-based algorithm for tabular RL and features the property “Sta-ble at Any Time”.
• Computer Science
ArXiv
• 2022
A meta-algorithm is presented that does not require any prior knowledge of the variation budget and can adaptively detect the non-stationarity on the exponential value functions and a dynamic regret lower bound is established for nonstationary risk-sensitive RL to certify the near-optimality of the proposed algorithms.
• Computer Science
ArXiv
• 2022
This paper proposes an original algorithm to determine the “best data chunks” that display similar dynamics over time and across individuals for policy learning, which alternates between most recent change point detection and cluster identiﬁcation.

## References

SHOWING 1-10 OF 81 REFERENCES

• Computer Science
ICML
• 2020
This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.
• Computer Science
ArXiv
• 2020
This work develops the first dynamic regret analysis in nonstationary reinforcement learning with function approximation in episodic Markov decision processes with linear function approximation under drifting environment and proposes a parameter-free algorithm that works without knowing the variation budgets but with a slightly worse dynamic regret bound.
• Computer Science
NeurIPS
• 2020
This work proposes two model-free policy optimization algorithms, POWER and POWER++, and establishes guarantees for their dynamic regret, and shows that POWER++ improves over POWER on the second component of the dynamic regret by actively adapting to non-stationarity through prediction.
• Computer Science
J. Mach. Learn. Res.
• 2021
This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs), and shows an important interplay between estimation error, approximation error, and exploration.
• Computer Science
• 2019
This work develops the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening, and proposes the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a parameter-free manner.
• Computer Science
AAAI
• 2020
This work shows that the adaptive scaling mechanism used in TRPO is in fact the natural “RL version” of traditional trust-region methods from convex analysis, and proves fast rates of Õ(1/N), much like results in convex optimization.
• Computer Science
ArXiv
• 2020
This work shows that the OPT-WLSVI algorithm, when competing against the best policy at each time, achieves a regret that is upped bounded by $\widetilde{\mathcal{O}}(d^{7/6}H^2 \Delta^{1/3} K^{2/3})$, the first regret bound for non-stationary reinforcement learning with linear function approximation.
• Computer Science
NeurIPS
• 2020
This work introduces the the Policy Cover-Policy Gradient algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover) and complements the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
• Computer Science
ICML
• 2020
This paper considers model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback, and proposes an optimistic trust region policy optimization (TRPO) algorithm, which establishes regret for stochastic rewards and proves regret for adversarial rewards.
• Computer Science
COLT
• 2021
We propose a black-box reduction that turns a certain reinforcement learning algorithm with optimal regret in a (near-)stationary environment into another algorithm with optimal dynamic regret in a