• Corpus ID: 229156235

Offline Policy Selection under Uncertainty

  title={Offline Policy Selection under Uncertainty},
  author={Mengjiao Yang and Bo Dai and Ofir Nachum and G. Tucker and Dale Schuurmans},
The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their policy values or high-confidence intervals, access to the full distribution over one's belief of the policy value enables more flexible selection… 

Figures from this paper

Universal Off-Policy Evaluation

This paper takes the first steps towards a universal off-policy estimator (UnO)—one that provides off-Policy estimates and high-confidence bounds for any parameter of the return distribution and discusses UnO's applicability in various settings.

Active Offline Policy Selection

This paper introduces active offline policy selection — a novel sequential decision approach that combines logged data with online interaction to identify the best policy.

Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds

This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation, which remains a challenging open question and develops a practical algorithm through a primal-dual optimization-based approach, which leverages the kernel Bellman loss of Feng et al. (2019).

Model Selection in Batch Policy Optimization

This work identifies three sources of error that any model selection algorithm should optimally trade-off in order to be competitive and shows that relaxing any one of the three error sources enables the design of algorithms achieving near-oracle inequalities for the remaining two.

Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters

It is demonstrated that while some very efficient variants also outperform current state-of-the-art methods, they do not match the performance and robustness of MSG with deep ensembles, and investigates whether efficient approximations can be similarly effective.

Towards Hyperparameter-free Policy Selection for Offline Reinforcement Learning

This paper designs hyperparameter-free algorithms for policy selection based on BVFT, a recent theoretical advance in value-function selection, and demonstrates their effectiveness in discrete-action benchmarks such as Atari.

LobsDICE: Offline Imitation Learning from Observation via Stationary Distribution Correction Estimation

This paper presents LobsDICE, an offline IfO algorithm that learns to imitate the expert policy via optimization in the space of stationary distributions, which solves a single convex minimization problem, which minimizes the divergence between the two state-transition distributions induced by the expert and the agent policy.

Pessimistic Model Selection for Offline Deep Reinforcement Learning

A pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee, which features a provably effective framework for finding the best policy among a set of candidate models.

A Minimalist Approach to Offline Reinforcement Learning

It is shown that the performance of state-of-the-art RL algorithms can be matched by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data, and the resulting algorithm is a simple to implement and tune baseline.

Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations

A new IL algorithm is designed, built upon behavioral cloning (BC), where the outputs of discriminator serve as the weights of the BC loss, and an additional discriminator is introduced to distinguish expert and non-expert data.



Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation

This work identifies conditions under which statistical bootstrapping in this setting is guaranteed to yield correct confidence intervals and evaluates the proposed method and shows that it can yield accurate confidence intervals in a variety of conditions.

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

Two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data are proposed.

Eligibility Traces for Off-Policy Policy Evaluation

This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.

Doubly Robust Policy Evaluation and Learning

It is proved that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies, and is expected to become common practice.

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

A new off-policy estimation method that applies importance sampling directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators is proposed.

Confident Off-Policy Evaluation and Selection through Self-Normalized Importance Weighting

A new method to compute a lower bound on the value of an arbitrary target policy given some logged data in contextual bandits for a desired coverage is proposed, built around the so-called Self-normalized Importance Weighting (SN) estimator.

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.

Off-Policy Evaluation via Off-Policy Classification

This paper proposes an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem with the Q-function as the decision function, and experimentally shows that this metric outperforms baselines on a number of tasks.

CoinDICE: Off-Policy Confidence Interval Estimation

This work proposes CoinDICE, a novel and efficient algorithm for computing confidence intervals in high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, and proves the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes.

Hyperparameter Selection for Offline Reinforcement Learning

This work focuses on offline hyperparameter selection, i.e. methods for choosing the best policy from a set of many policies trained using different hyperparameters, given only logged data, and shows that offline RL algorithms are not robust tohyperparameter choices.