• Corpus ID: 209323916

# Provably Efficient Exploration in Policy Optimization

@article{Cai2019ProvablyEE,
title={Provably Efficient Exploration in Policy Optimization},
author={Qi Cai and Zhuoran Yang and Chi Jin and Zhaoran Wang},
journal={ArXiv},
year={2019},
volume={abs/1912.05830}
}
• Published 12 December 2019
• Computer Science
• ArXiv
While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an optimistic version'' of the policy gradient direction…
185 Citations
• Computer Science
ArXiv
• 2022
A dynamic regret bound and a constraint violation bound are established for the proposed algorithm in both the linear kernel CMDP function approximation setting and the tabular CMDP setting under two alternative assumptions.
• Computer Science
AISTATS
• 2022
This paper proposes an optimistic policy optimization algorithm POWERS and shows that it can achieve regret, and proves a matching lower bound of (cid:101) Ω( dH √ T ) up to logarithmic factors.
• Computer Science
NeurIPS
• 2021
Two positive results are given showing that provably sample-efficient RL is possible either under an additional low-variance assumption or under a novel hypercontractivity assumption, showing that an exponential sample complexity lower bound still holds even if a constant gap is assumed.
• Computer Science, Mathematics
ArXiv
• 2021
An optimistic policy optimization algorithm with Bernstein bonus is proposed and it is shown that it can achieve regret, the first computationally efficient, nearly minimax optimal algorithm for adversarial Markov decision processes with linear function approximation.
• Computer Science
ArXiv
• 2020
This work shows that the OPT-WLSVI algorithm, when competing against the best policy at each time, achieves a regret that is upped bounded by $\widetilde{\mathcal{O}}(d^{7/6}H^2 \Delta^{1/3} K^{2/3})$, the first regret bound for non-stationary reinforcement learning with linear function approximation.
• Computer Science
ICML
• 2020
This paper considers model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback, and proposes an optimistic trust region policy optimization (TRPO) algorithm, which establishes regret for stochastic rewards and proves regret for adversarial rewards.
• Computer Science
ArXiv
• 2021
This paper proposes an optimistic generative adversarial policy optimization algorithm (OGAP) and proves that OGAP achieves regret, and obtains the optimality gap of PGAP, achieving the minimax lower bound in the utilization of the additional dataset.
• Computer Science
ICML
• 2021
A model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle that drives exploration by simply perturbing the training data with judiciously chosen i.d. scalar noises.
• Computer Science
• 2021
An optimistic model-based policy optimization algorithm is proposed, which allows general function approximations while incorporating exploration in the episodic setting and establishes a T-regret that scales polynomially in the eluder dimension of the general model class.
• Computer Science
ICML
• 2020
This analysis establishes the global optimality and convergence rate of GAIL with neural networks for the first time and analyzed a gradient-based algorithm with alternating updates to establish its sublinear convergence to the globally optimal solution.

## References

SHOWING 1-10 OF 85 REFERENCES

• Computer Science
ICML
• 2019
This work proposes a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension $K$ and invariant with respect to the size of the state space, and exploits the monotonicity property and intrinsic noise structure of the Bellman operator.
• Computer Science
ICML
• 2015
A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).
• Computer Science
• 2006
This chapter discusses prediction with expert advice, efficient forecasters for large classes of experts, and randomized prediction for specific losses.
This work provides a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space and shows drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.
• Computer Science
IEEE Transactions on Neural Networks
• 2005
This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
• Economics, Computer Science
Found. Trends Mach. Learn.
• 2012
The focus is on two extreme cases in which the analysis of regret is particularly simple and elegant: independent and identically distributed payoffs and adversarial payoffs.
• Computer Science
NeurIPS
• 2018
Q-learning with UCB exploration achieves regret in an episodic MDP setting, and this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."
• Computer Science
ICML
• 2021
This paper proposes a novel algorithm which makes use of the feature mapping and obtains a first polynomial regret bound, and suggests that the proposed reinforcement learning algorithm is near-optimal up to a $(1-\gamma)^{-0.5}$ factor.
• Computer Science
ArXiv
• 2019
Another line of work, which centers around a statistic called the eluder dimension, establishes tractability of problems similar to those considered in the Du-Kakade-Wang-Yang paper, and compares results and reconcile interpretations.
• Computer Science
NeurIPS
• 2019
This work develops no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes.