• Corpus ID: 229156235

Offline Policy Selection under Uncertainty

@inproceedings{Yang2022OfflinePS,
  title={Offline Policy Selection under Uncertainty},
  author={Mengjiao Yang and Bo Dai and Ofir Nachum and G. Tucker and Dale Schuurmans},
  booktitle={AISTATS},
  year={2022}
}
The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their policy values or high-confidence intervals, access to the full distribution over one's belief of the policy value enables more flexible selection… 

Figures from this paper

Universal Off-Policy Evaluation
TLDR
This paper takes the first steps towards a universal off-policy estimator (UnO)—one that provides off-Policy estimates and high-confidence bounds for any parameter of the return distribution and discusses UnO's applicability in various settings.
Active Offline Policy Selection
TLDR
This paper introduces active offline policy selection — a novel sequential decision approach that combines logged data with online interaction to identify the best policy.
Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds
TLDR
This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation, which remains a challenging open question and develops a practical algorithm through a primal-dual optimization-based approach, which leverages the kernel Bellman loss of Feng et al. (2019).
Model Selection in Batch Policy Optimization
TLDR
This work identifies three sources of error that any model selection algorithm should optimally trade-off in order to be competitive and shows that relaxing any one of the three error sources enables the design of algorithms achieving near-oracle inequalities for the remaining two.
Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters
TLDR
It is demonstrated that while some very efficient variants also outperform current state-of-the-art methods, they do not match the performance and robustness of MSG with deep ensembles, and investigates whether efficient approximations can be similarly effective.
Towards Hyperparameter-free Policy Selection for Offline Reinforcement Learning
TLDR
This paper designs hyperparameter-free algorithms for policy selection based on BVFT, a recent theoretical advance in value-function selection, and demonstrates their effectiveness in discrete-action benchmarks such as Atari.
Pessimistic Model Selection for Offline Deep Reinforcement Learning
TLDR
This work proposes a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee, which features a provably effective framework for finding the best policy among a set of candidate models.
LobsDICE: Offline Imitation Learning from Observation via Stationary Distribution Correction Estimation
TLDR
This paper presents LobsDICE, an offline IfO algorithm that learns to imitate the expert policy via optimization in the space of stationary distributions, which solves a single convex minimization problem, which minimizes the divergence between the two state-transition distributions induced by the expert and the agent policy.
A Minimalist Approach to Offline Reinforcement Learning
TLDR
It is shown that the performance of state-of-the-art RL algorithms can be matched by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data, and the resulting algorithm is a simple to implement and tune baseline.
Provably Efficient Representation Learning in Low-rank Markov Decision Processes
TLDR
A provably efficient algorithm called ReLEX is proposed that can simultaneously learn the representation and perform exploration and will be strictly better in terms of sample efficiency if the function class of representations enjoys a certain mild “coverage” property over the whole state-action space.
...
...

References

SHOWING 1-10 OF 74 REFERENCES
Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation
TLDR
Two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data are proposed.
Doubly Robust Policy Evaluation and Learning
TLDR
It is proved that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies, and is expected to become common practice.
Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
TLDR
A new off-policy estimation method that applies importance sampling directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators is proposed.
Confident Off-Policy Evaluation and Selection through Self-Normalized Importance Weighting
TLDR
A new method to compute a lower bound on the value of an arbitrary target policy given some logged data in contextual bandits for a desired coverage is proposed, built around the so-called Self-normalized Importance Weighting (SN) estimator.
DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
TLDR
This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.
Off-Policy Evaluation via Off-Policy Classification
TLDR
This paper proposes an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem with the Q-function as the decision function, and experimentally shows that this metric outperforms baselines on a number of tasks.
CoinDICE: Off-Policy Confidence Interval Estimation
TLDR
This work proposes CoinDICE, a novel and efficient algorithm for computing confidence intervals in high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, and proves the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes.
Hyperparameter Selection for Offline Reinforcement Learning
TLDR
This work focuses on offline hyperparameter selection, i.e. methods for choosing the best policy from a set of many policies trained using different hyperparameters, given only logged data, and shows that offline RL algorithms are not robust tohyperparameter choices.
Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning
TLDR
It is proved that with a simple modification to the MIS estimator, one can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite.
Accountable Off-Policy Evaluation With Kernel Bellman Statistics
TLDR
A new variational framework is proposed which reduces the problem of calculating tight confidence bounds in OPE into an optimization problem on a feasible set that catches the true state-action value function with high probability.
...
...