Is Pessimism Provably Efficient for Offline RL?
- Ying Jin, Zhuoran Yang, Zhaoran Wang
- Computer ScienceInternational Conference on Machine Learning
- 30 December 2020
A pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function and establishes a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs).
Provably Efficient Exploration in Policy Optimization
- Qi Cai, Zhuoran Yang, Chi Jin, Zhaoran Wang
- Computer ScienceInternational Conference on Machine Learning
- 12 December 2019
This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.
Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium
- Qiaomin Xie, Yudong Chen, Zhaoran Wang, Zhuoran Yang
- Computer ScienceAnnual Conference Computational Learning Theory
- 17 February 2020
This work develops provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games with simultaneous moves and proposes an optimistic variant of the least-squares minimax value iteration algorithm.
Neural Trust Region/Proximal Policy Optimization Attains Globally Optimal Policy
- Boyi Liu, Qi Cai, Zhuoran Yang, Zhaoran Wang
- Computer ScienceNeural Information Processing Systems
- 2019
It is proved that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the globally optimal policy at a sublinear rate.
Neural Temporal-Difference Learning Converges to Global Optima
- Qi Cai, Zhuoran Yang, J. Lee, Zhaoran Wang
- Computer ScienceNeural Information Processing Systems
- 24 May 2019
This paper proves for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation and establishes the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.
A Near-Optimal Algorithm for Stochastic Bilevel Optimization via Double-Momentum
- Prashant Khanduri, Siliang Zeng, Mingyi Hong, Hoi-To Wai, Zhaoran Wang, Zhuoran Yang
- Computer ScienceNeural Information Processing Systems
- 15 February 2021
This work proposes a new algorithm – the Single-timescale Double-momentum Stochastic Approximation (SUSTAIN) – for tackling stochastic unconstrained bilevel optimization problems where the lower level subproblem is strongly-convex and the upper level objective function is smooth.
On Finite-Time Convergence of Actor-Critic Algorithm
- Shuang Qiu, Zhuoran Yang, Jieping Ye, Zhaoran Wang
- Computer ScienceIEEE Journal on Selected Areas in Information…
- 1 June 2021
This work seems to provide the first finite-time convergence analysis for an online actor-critic algorithm with TD learning and gives a theoretical analysis of the TD(0) algorithm for the average reward with dependent data in online settings.
Convergent Policy Optimization for Safe Reinforcement Learning
- Ming Yu, Zhuoran Yang, M. Kolar, Zhaoran Wang
- Mathematics, Computer ScienceNeural Information Processing Systems
- 26 October 2019
This work constructs a sequence of surrogate convex constrained optimization problems by replacing the nonconvex functions locally with convex quadratic functions obtained from policy gradient estimators, and proves that the solutions to these surrogate problems converge to a stationary point of the original non Convex problem.
Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning
- Shuang Qiu, Zhuoran Yang, Xiaohan Wei, Jieping Ye, Zhaoran Wang
- Computer ScienceArXiv
- 23 August 2020
This paper proposes two single-timescale single-loop algorithms which require only one data point each step and implements momentum updates on both primal and dual variables achieving an $O(\varepsilon^{-4})$ sample complexity, which shows the important role of momentum in obtaining a single- Timescale algorithm.
Randomized Exploration for Reinforcement Learning with General Value Function Approximation
- Haque Ishfaq, Qiwen Cui, Lin F. Yang
- Computer ScienceInternational Conference on Machine Learning
- 15 June 2021
A model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle that drives exploration by simply perturbing the training data with judiciously chosen i.d. scalar noises.
...
...