• Corpus ID: 235165713

# Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

@inproceedings{Yin2021OptimalUO,
title={Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings},
author={Ming Yin and Yu-Xiang Wang},
booktitle={Neural Information Processing Systems},
year={2021}
}
• Published in
Neural Information Processing…
13 May 2021
• Computer Science
This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. We establish an ⌦(HS/dm✏) lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of Õ(H/dm✏) for the local uniform convergence. The highlight in achieving the optimal rate Õ(H/dm✏) is our…

## Figures from this paper

• Computer Science
ArXiv
• 2022
The RFO LIVE (Reward-Free O LIVE) algorithm is proposed for sample-efﬁcient reward-free exploration under minimal structural assumptions, which covers the previously stud-ied settings of linear MDPs, linear completeness, and low-rank M DPs with unknown representation.
• Computer Science
ArXiv
• 2022
This work designs RL algorithms with provable privacy guarantees which enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings and suggests that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.
• Computer Science
• 2021
The variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs) and provides improved offline learning bounds over the existing best-known results.
• Computer Science, Mathematics
• 2022
The statistical theory of offline RL with deep ReLU network function approximation is studied, and the sample complexity of offline reinforcement learning is established.
• Computer Science
ArXiv
• 2022
A model-based algorithm that combines distributionally robust value iteration with the principle of pessimism in the face of uncertainty is proposed, by penalizing the robust value estimates with a carefully designed data-driven penalty term.
• Computer Science
ArXiv
• 2022
This work proposes a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward.
• Computer Science
ArXiv
• 2022
This paper theoretically characterize the beneﬁt of representation learning in exploration-based reward-free multitask RL for both upstream and downstream tasks, and demonstrates that multitask representation learning is provably more sample-eﬃcient than learning each task individually, as long as the total number of tasks is above a certain threshold.
• Computer Science
ICLR
• 2022
The variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs), and provides improved offline learning bounds over the best-known existing results.
• Computer Science
AISTATS
• 2022
This paper shows that even with the best (but passively chosen) logging policy, Ω(A min(S − 1 ,H ) /ε 2 ) episodes are necessary to obtain an ε -optimal policy, where H is the length of episodes.
• Computer Science
NeurIPS
• 2021
This work analyzes the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches O  H ∑ in this work.

## References

SHOWING 1-10 OF 86 REFERENCES

• Computer Science
NeurIPS
• 2018
Q-learning with UCB exploration achieves regret in an episodic MDP setting, and this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."
• Computer Science, Mathematics
NeurIPS
• 2021
This paper proposes Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL and establishes an information-theoretic lower bound of ⌦(H/dm✏) which certifies that OPDVR is optimal up to logarithmic factors.
• Computer Science
AISTATS
• 2021
The comprehensive relationship between OPE and offline learning for the first time is revealed and uniform convergence guarantees in OPE can be obtained efficiently.
• Computer Science
ALT
• 2021
This work proves that RF-UCRL needs O (SAH 4 /e 2) log(1/δ) episodes to output, with probability 1 − δ, an e-approximation of the optimal policy for any reward function, and empirically compare it to oracle strategies using a generative model.
• Computer Science
NIPS
• 2007
A rigorous analysis is provided of a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values, proved to be the first finite-time bound for value-function based algorithms for continuous state and action problems.
• Computer Science
IEEE Transactions on Neural Networks
• 2005
This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

### 2020) is not sharp for finite horizon stationary setting, as it requires s-absorbing MDPs with H-dimensional cover (which has size ≈ eH and it is not optimal)

• 2020
• Computer Science
NeurIPS
• 2021
This work analyzes the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches O  H ∑ in this work.
• Computer Science
ICML
• 2021
This work introduces a class of confidenceadjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis and introduces a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.