• Corpus ID: 239016322

Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

@article{Zhong2021OptimisticPO,
  title={Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs},
  author={Han Zhong and Zhuoran Yang and Zhaoran Wang and Csaba Szepesv{\'a}ri},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.08984}
}
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs). In this setting, both the reward function and the transition kernel are linear with respect to the given feature maps and are allowed to vary over time, as long as their respective parameter variations do not exceed certain variation budgets. We propose the periodically restarted optimistic policy optimization algorithm (PROPO), which is an optimistic policy optimization algorithm… 

Figures and Tables from this paper

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

A dynamic regret bound and a constraint violation bound are established for the proposed algorithm in both the linear kernel CMDP function approximation setting and the tabular CMDP setting under two alternative assumptions.

Nonstationary Reinforcement Learning with Linear Function Approximation

This work develops the first dynamic regret analysis in nonstationary reinforcement learning with function approximation in episodic Markov decision processes with linear function approximation under drifting environment and proposes a parameter-free algorithm that works without knowing the variation budgets but with a slightly worse dynamic regret bound.

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

RPO-SAT is the first computationally computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL and features the property “Sta-ble at Any Time”.

Non-stationary Risk-sensitive Reinforcement Learning: Near-optimal Dynamic Regret, Adaptive Detection, and Separation Design

A meta-algorithm is presented that does not require any prior knowledge of the variation budget and can adaptively detect the non-stationarity on the exponential value functions and a dynamic regret lower bound is established for nonstationary risk-sensitive RL to certify the near-optimality of the proposed algorithms.

Doubly Inhomogeneous Reinforcement Learning

This paper proposes an original algorithm to determine the “best data chunks” that display similar dynamics over time and across individuals for policy learning, which alternates between most recent change point detection and cluster identification.

References

SHOWING 1-10 OF 81 REFERENCES

Provably Efficient Exploration in Policy Optimization

This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.

Nonstationary Reinforcement Learning with Linear Function Approximation

This work develops the first dynamic regret analysis in nonstationary reinforcement learning with function approximation in episodic Markov decision processes with linear function approximation under drifting environment and proposes a parameter-free algorithm that works without knowing the variation budgets but with a slightly worse dynamic regret bound.

Dynamic Regret of Policy Optimization in Non-stationary Environments

This work proposes two model-free policy optimization algorithms, POWER and POWER++, and establishes guarantees for their dynamic regret, and shows that POWER++ improves over POWER on the second component of the dynamic regret by actively adapting to non-stationarity through prediction.

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs), and shows an important interplay between estimation error, approximation error, and exploration.

Non-Stationary Reinforcement Learning: The Blessing of (More) Optimism

This work develops the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening, and proposes the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a parameter-free manner.

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

This work shows that the adaptive scaling mechanism used in TRPO is in fact the natural “RL version” of traditional trust-region methods from convex analysis, and proves fast rates of Õ(1/N), much like results in convex optimization.

Efficient Learning in Non-Stationary Linear Markov Decision Processes

This work shows that the OPT-WLSVI algorithm, when competing against the best policy at each time, achieves a regret that is upped bounded by $\widetilde{\mathcal{O}}(d^{7/6}H^2 \Delta^{1/3} K^{2/3})$, the first regret bound for non-stationary reinforcement learning with linear function approximation.

PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning

This work introduces the the Policy Cover-Policy Gradient algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover) and complements the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.

Optimistic Policy Optimization with Bandit Feedback

This paper considers model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback, and proposes an optimistic trust region policy optimization (TRPO) algorithm, which establishes regret for stochastic rewards and proves regret for adversarial rewards.

Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

We propose a black-box reduction that turns a certain reinforcement learning algorithm with optimal regret in a (near-)stationary environment into another algorithm with optimal dynamic regret in a
...