Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings
@inproceedings{Yin2021OptimalUO, title={Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings}, author={Ming Yin and Yu-Xiang Wang}, booktitle={Neural Information Processing Systems}, year={2021} }
This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. We establish an ⌦(HS/dm✏) lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of Õ(H/dm✏) for the local uniform convergence. The highlight in achieving the optimal rate Õ(H/dm✏) is our…
11 Citations
On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL
- Computer ScienceArXiv
- 2022
The RFO LIVE (Reward-Free O LIVE) algorithm is proposed for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously stud-ied settings of linear MDPs, linear completeness, and low-rank M DPs with unknown representation.
Offline Reinforcement Learning with Differential Privacy
- Computer ScienceArXiv
- 2022
This work designs RL algorithms with provable privacy guarantees which enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings and suggests that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.
N EAR - OPTIMAL O FFLINE R EINFORCEMENT L EARN ING WITH L INEAR R EPRESENTATION : L EVERAGING V ARIANCE I NFORMATION WITH P ESSIMISM
- Computer Science
- 2021
The variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs) and provides improved offline learning bounds over the existing best-known results.
On Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks in Besov Spaces
- Computer Science, Mathematics
- 2022
The statistical theory of offline RL with deep ReLU network function approximation is studied, and the sample complexity of offline reinforcement learning is established.
Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity
- Computer ScienceArXiv
- 2022
A model-based algorithm that combines distributionally robust value iteration with the principle of pessimism in the face of uncertainty is proposed, by penalizing the robust value estimates with a carefully designed data-driven penalty term.
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward
- Computer ScienceArXiv
- 2022
This work proposes a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward.
Provable Benefit of Multitask Representation Learning in Reinforcement Learning
- Computer ScienceArXiv
- 2022
This paper theoretically characterize the benefit of representation learning in exploration-based reward-free multitask RL for both upstream and downstream tasks, and demonstrates that multitask representation learning is provably more sample-efficient than learning each task individually, as long as the total number of tasks is above a certain threshold.
Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism
- Computer ScienceICLR
- 2022
The variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs), and provides improved offline learning bounds over the best-known existing results.
The Curse of Passive Data Collection in Batch Reinforcement Learning
- Computer ScienceAISTATS
- 2022
This paper shows that even with the best (but passively chosen) logging policy, Ω(A min(S − 1 ,H ) /ε 2 ) episodes are necessary to obtain an ε -optimal policy, where H is the length of episodes.
Towards Instance-Optimal Offline Reinforcement Learning with Pessimism
- Computer ScienceNeurIPS
- 2021
This work analyzes the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches O H ∑ in this work.
References
SHOWING 1-10 OF 86 REFERENCES
Is Q-learning Provably Efficient?
- Computer ScienceNeurIPS
- 2018
Q-learning with UCB exploration achieves regret in an episodic MDP setting, and this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."
Near-Optimal Offline Reinforcement Learning via Double Variance Reduction
- Computer Science, MathematicsNeurIPS
- 2021
This paper proposes Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL and establishes an information-theoretic lower bound of ⌦(H/dm✏) which certifies that OPDVR is optimal up to logarithmic factors.
Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning
- Computer ScienceAISTATS
- 2021
The comprehensive relationship between OPE and offline learning for the first time is revealed and uniform convergence guarantees in OPE can be obtained efficiently.
Adaptive Reward-Free Exploration
- Computer ScienceALT
- 2021
This work proves that RF-UCRL needs O (SAH 4 /e 2) log(1/δ) episodes to output, with probability 1 − δ, an e-approximation of the optimal policy for any reward function, and empirically compare it to oracle strategies using a generative model.
Fitted Q-iteration in continuous action-space MDPs
- Computer ScienceNIPS
- 2007
A rigorous analysis is provided of a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values, proved to be the first finite-time bound for value-function based algorithms for continuous state and action problems.
Reinforcement Learning: An Introduction
- Computer ScienceIEEE Transactions on Neural Networks
- 2005
This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
2020) is not sharp for finite horizon stationary setting, as it requires s-absorbing MDPs with H-dimensional cover (which has size ≈ eH and it is not optimal)
- 2020
Towards Instance-Optimal Offline Reinforcement Learning with Pessimism
- Computer ScienceNeurIPS
- 2021
This work analyzes the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches O H ∑ in this work.
On the Optimality of Batch Policy Optimization Algorithms
- Computer ScienceICML
- 2021
This work introduces a class of confidenceadjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis and introduces a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.