# Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

@inproceedings{Yin2021OptimalUO, title={Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings}, author={Ming Yin and Yu-Xiang Wang}, booktitle={Neural Information Processing Systems}, year={2021} }

This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. We establish an ⌦(HS/dm✏) lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of Õ(H/dm✏) for the local uniform convergence. The highlight in achieving the optimal rate Õ(H/dm✏) is our…

## 11 Citations

### On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL

- Computer ScienceArXiv
- 2022

The RFO LIVE (Reward-Free O LIVE) algorithm is proposed for sample-efﬁcient reward-free exploration under minimal structural assumptions, which covers the previously stud-ied settings of linear MDPs, linear completeness, and low-rank M DPs with unknown representation.

### Offline Reinforcement Learning with Differential Privacy

- Computer ScienceArXiv
- 2022

This work designs RL algorithms with provable privacy guarantees which enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings and suggests that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.

### N EAR - OPTIMAL O FFLINE R EINFORCEMENT L EARN ING WITH L INEAR R EPRESENTATION : L EVERAGING V ARIANCE I NFORMATION WITH P ESSIMISM

- Computer Science
- 2021

The variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs) and provides improved offline learning bounds over the existing best-known results.

### On Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks in Besov Spaces

- Computer Science, Mathematics
- 2022

The statistical theory of offline RL with deep ReLU network function approximation is studied, and the sample complexity of offline reinforcement learning is established.

### Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity

- Computer ScienceArXiv
- 2022

A model-based algorithm that combines distributionally robust value iteration with the principle of pessimism in the face of uncertainty is proposed, by penalizing the robust value estimates with a carefully designed data-driven penalty term.

### Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward

- Computer ScienceArXiv
- 2022

This work proposes a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward.

### Provable Benefit of Multitask Representation Learning in Reinforcement Learning

- Computer ScienceArXiv
- 2022

This paper theoretically characterize the beneﬁt of representation learning in exploration-based reward-free multitask RL for both upstream and downstream tasks, and demonstrates that multitask representation learning is provably more sample-eﬃcient than learning each task individually, as long as the total number of tasks is above a certain threshold.

### Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism

- Computer ScienceICLR
- 2022

The variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs), and provides improved offline learning bounds over the best-known existing results.

### The Curse of Passive Data Collection in Batch Reinforcement Learning

- Computer ScienceAISTATS
- 2022

This paper shows that even with the best (but passively chosen) logging policy, Ω(A min(S − 1 ,H ) /ε 2 ) episodes are necessary to obtain an ε -optimal policy, where H is the length of episodes.

### Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

- Computer ScienceNeurIPS
- 2021

This work analyzes the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches O H ∑ in this work.

## References

SHOWING 1-10 OF 86 REFERENCES

### Is Q-learning Provably Efficient?

- Computer ScienceNeurIPS
- 2018

Q-learning with UCB exploration achieves regret in an episodic MDP setting, and this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."

### Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

- Computer Science, MathematicsNeurIPS
- 2021

This paper proposes Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL and establishes an information-theoretic lower bound of ⌦(H/dm✏) which certifies that OPDVR is optimal up to logarithmic factors.

### Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning

- Computer ScienceAISTATS
- 2021

The comprehensive relationship between OPE and offline learning for the first time is revealed and uniform convergence guarantees in OPE can be obtained efficiently.

### Adaptive Reward-Free Exploration

- Computer ScienceALT
- 2021

This work proves that RF-UCRL needs O (SAH 4 /e 2) log(1/δ) episodes to output, with probability 1 − δ, an e-approximation of the optimal policy for any reward function, and empirically compare it to oracle strategies using a generative model.

### Fitted Q-iteration in continuous action-space MDPs

- Computer ScienceNIPS
- 2007

A rigorous analysis is provided of a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values, proved to be the first finite-time bound for value-function based algorithms for continuous state and action problems.

### Reinforcement Learning: An Introduction

- Computer ScienceIEEE Transactions on Neural Networks
- 2005

This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

### 2020) is not sharp for finite horizon stationary setting, as it requires s-absorbing MDPs with H-dimensional cover (which has size ≈ eH and it is not optimal)

- 2020

### Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

- Computer ScienceNeurIPS
- 2021

This work analyzes the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches O H ∑ in this work.

### On the Optimality of Batch Policy Optimization Algorithms

- Computer ScienceICML
- 2021

This work introduces a class of confidenceadjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis and introduces a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.