• Corpus ID: 235165713

Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

@inproceedings{Yin2021OptimalUO,
  title={Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings},
  author={Ming Yin and Yu-Xiang Wang},
  booktitle={Neural Information Processing Systems},
  year={2021}
}
This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. We establish an ⌦(HS/dm✏) lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of Õ(H/dm✏) for the local uniform convergence. The highlight in achieving the optimal rate Õ(H/dm✏) is our… 

Figures from this paper

On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL

The RFO LIVE (Reward-Free O LIVE) algorithm is proposed for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously stud-ied settings of linear MDPs, linear completeness, and low-rank M DPs with unknown representation.

Offline Reinforcement Learning with Differential Privacy

This work designs RL algorithms with provable privacy guarantees which enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings and suggests that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.

N EAR - OPTIMAL O FFLINE R EINFORCEMENT L EARN ING WITH L INEAR R EPRESENTATION : L EVERAGING V ARIANCE I NFORMATION WITH P ESSIMISM

The variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs) and provides improved offline learning bounds over the existing best-known results.

On Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks in Besov Spaces

The statistical theory of offline RL with deep ReLU network function approximation is studied, and the sample complexity of offline reinforcement learning is established.

Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity

A model-based algorithm that combines distributionally robust value iteration with the principle of pessimism in the face of uncertainty is proposed, by penalizing the robust value estimates with a carefully designed data-driven penalty term.

Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward

This work proposes a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward.

Provable Benefit of Multitask Representation Learning in Reinforcement Learning

This paper theoretically characterize the benefit of representation learning in exploration-based reward-free multitask RL for both upstream and downstream tasks, and demonstrates that multitask representation learning is provably more sample-efficient than learning each task individually, as long as the total number of tasks is above a certain threshold.

Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism

The variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs), and provides improved offline learning bounds over the best-known existing results.

The Curse of Passive Data Collection in Batch Reinforcement Learning

This paper shows that even with the best (but passively chosen) logging policy, Ω(A min(S − 1 ,H ) /ε 2 ) episodes are necessary to obtain an ε -optimal policy, where H is the length of episodes.

Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

This work analyzes the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches O  H ∑ in this work.

References

SHOWING 1-10 OF 86 REFERENCES

Is Q-learning Provably Efficient?

Q-learning with UCB exploration achieves regret in an episodic MDP setting, and this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

This paper proposes Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL and establishes an information-theoretic lower bound of ⌦(H/dm✏) which certifies that OPDVR is optimal up to logarithmic factors.

Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning

The comprehensive relationship between OPE and offline learning for the first time is revealed and uniform convergence guarantees in OPE can be obtained efficiently.

Reinforcement Learning: Theory and Algorithms

Adaptive Reward-Free Exploration

This work proves that RF-UCRL needs O (SAH 4 /e 2) log(1/δ) episodes to output, with probability 1 − δ, an e-approximation of the optimal policy for any reward function, and empirically compare it to oracle strategies using a generative model.

Fitted Q-iteration in continuous action-space MDPs

A rigorous analysis is provided of a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values, proved to be the first finite-time bound for value-function based algorithms for continuous state and action problems.

Reinforcement Learning: An Introduction

This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

2020) is not sharp for finite horizon stationary setting, as it requires s-absorbing MDPs with H-dimensional cover (which has size ≈ eH and it is not optimal)

  • 2020

Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

This work analyzes the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches O  H ∑ in this work.

On the Optimality of Batch Policy Optimization Algorithms

This work introduces a class of confidenceadjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis and introduces a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
...