# Provably Efficient Exploration in Policy Optimization

@inproceedings{Cai2020ProvablyEE, title={Provably Efficient Exploration in Policy Optimization}, author={Qi Cai and Zhuoran Yang and Chi Jin and Zhaoran Wang}, booktitle={ICML}, year={2020} }

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction…

## 148 Citations

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

- Computer ScienceArXiv
- 2022

A dynamic regret bound and a constraint violation bound are established for the proposed algorithm in both the linear kernel CMDP function approximation setting and the tabular CMDP setting under two alternative assumptions.

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

- Computer ScienceICML
- 2022

A novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (RPO-SAT), which features the property “Stable at any Time”.

Near-optimal Policy Optimization Algorithms for Learning Adversarial Linear Mixture MDPs

- Computer ScienceAISTATS
- 2022

This paper proposes an optimistic policy optimization algorithm POWERS and shows that it can achieve regret, and proves a matching lower bound of (cid:101) Ω( dH √ T ) up to logarithmic factors.

Online Apprenticeship Learning

- Computer ScienceAAAI
- 2022

An online variant of AL (Online Apprenticeship Learning; OAL), where the agent is expected to perform comparably to the expert while interacting with the environment, and a convergent algorithm with O(sqrt(K)) regret, where K is the number of interactions with the MDP, and an additional linear error term that depends on the amount of expert trajectories available.

An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

- Computer ScienceNeurIPS
- 2021

Two positive results are given showing that provably sample-efficient RL is possible either under an additional low-variance assumption or under a novel hypercontractivity assumption, showing that an exponential sample complexity lower bound still holds even if a constant gap is assumed.

Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation

- Computer Science, MathematicsArXiv
- 2021

An optimistic policy optimization algorithm with Bernstein bonus is proposed and it is shown that it can achieve regret, the first computationally efficient, nearly minimax optimal algorithm for adversarial Markov decision processes with linear function approximation.

Efficient Learning in Non-Stationary Linear Markov Decision Processes

- Computer ScienceArXiv
- 2020

This work shows that the OPT-WLSVI algorithm, when competing against the best policy at each time, achieves a regret that is upped bounded by $\widetilde{\mathcal{O}}(d^{7/6}H^2 \Delta^{1/3} K^{2/3})$, the first regret bound for non-stationary reinforcement learning with linear function approximation.

Optimistic Policy Optimization with Bandit Feedback

- Computer ScienceICML
- 2020

This paper considers model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback, and proposes an optimistic trust region policy optimization (TRPO) algorithm, which establishes regret for stochastic rewards and proves regret for adversarial rewards.

Provably Efficient Generative Adversarial Imitation Learning for Online and Offline Setting with Linear Function Approximation

- Computer ScienceArXiv
- 2021

This paper proposes an optimistic generative adversarial policy optimization algorithm (OGAP) and proves that OGAP achieves regret, and obtains the optimality gap of PGAP, achieving the minimax lower bound in the utilization of the additional dataset.

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

- Computer ScienceICML
- 2021

A model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle that drives exploration by simply perturbing the training data with judiciously chosen i.d. scalar noises.

## References

SHOWING 1-10 OF 93 REFERENCES

Sample-Optimal Parametric Q-Learning Using Linearly Additive Features

- Computer ScienceICML
- 2019

This work proposes a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension $K$ and invariant with respect to the size of the state space, and exploits the monotonicity property and intrinsic noise structure of the Bellman operator.

Trust Region Policy Optimization

- Computer ScienceICML
- 2015

A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).

A Natural Policy Gradient

- Computer ScienceNIPS
- 2001

This work provides a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space and shows drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.

Reinforcement Learning: An Introduction

- Computer ScienceIEEE Transactions on Neural Networks
- 2005

This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

- Economics, Computer ScienceFound. Trends Mach. Learn.
- 2012

The focus is on two extreme cases in which the analysis of regret is particularly simple and elegant: independent and identically distributed payoffs and adversarial payoffs.

Is Q-learning Provably Efficient?

- Computer ScienceNeurIPS
- 2018

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically…

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

- Computer ScienceICML
- 2021

This paper proposes a novel algorithm which makes use of the feature mapping and obtains a first polynomial regret bound, and suggests that the proposed reinforcement learning algorithm is near-optimal up to a $(1-\gamma)^{-0.5}$ factor.

Comments on the Du-Kakade-Wang-Yang Lower Bounds

- Computer ScienceArXiv
- 2019

Another line of work, which centers around a statistic called the eluder dimension, establishes tractability of problems similar to those considered in the Du-Kakade-Wang-Yang paper, and compares results and reconcile interpretations.

Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound

- Computer ScienceICML
- 2020

These results are the first regret bounds that are near-optimal in time $T$ and dimension $d$ (or $\widetilde{d}$) and polynomial in the planning horizon $H$.

Proximal Policy Optimization Algorithms

- Computer ScienceArXiv
- 2017

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective…