• Corpus ID: 225040328

CoinDICE: Off-Policy Confidence Interval Estimation

@article{Dai2020CoinDICEOC,
  title={CoinDICE: Off-Policy Confidence Interval Estimation},
  author={Bo Dai and Ofir Nachum and Yinlam Chow and Lihong Li and Csaba Szepesvari and Dale Schuurmans},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.11652}
}
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy's value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the $Q$-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to… 

Figures from this paper

Deeply-Debiased Off-Policy Interval Estimation
TLDR
A novel deeply-debiasing procedure is proposed to construct an efficient, robust, and flexible CI on a target policy’s value that quantifies the uncertainty of the point estimate.
Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds
TLDR
This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation, which remains a challenging open question and develops a practical algorithm through a primal-dual optimization-based approach, which leverages the kernel Bellman loss of Feng et al. (2019).
Bootstrapping Fitted Q-Evaluation for Off-Policy Inference
TLDR
This paper proposes a bootstrapping FQE method for inferring the distribution of the policy evaluation error and shows that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference.
Bootstrapping Statistical Inference for Off-Policy Evaluation
TLDR
This paper proposes a bootstrapping FQE method for inferring the distribution of the policy evaluation error and shows that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference.
Off-policy Confidence Sequences
TLDR
This work develops confidence bounds that hold uniformly over time for off-policy evaluation in the contextual bandit setting and provides algorithms for computing these confidence sequences that strike a good balance between computational and statistical efficiency.
Offline Policy Selection under Uncertainty
TLDR
It is shown how the belief distribution estimated by BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric, and it is empirically demonstrated that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates.
Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning
TLDR
This work presents the first comprehensive empirical analysis of a broad suite of OPE methods, and offers a summarized set of guidelines for effectively using OPE in practice, and suggest directions for future research.
OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation
TLDR
This paper presents an offline RL algorithm, OptiDICE, that directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms.
Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders
TLDR
It is shown how, given only a latent variable model for states and actions, policy value can be identified from off-policy data, and optimal balancing can be combined with such learned ratios to obtain policy value while avoiding direct modeling of reward functions.
On the Optimality of Batch Policy Optimization Algorithms
TLDR
This work introduces a class of confidenceadjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis and introduces a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
...
1
2
3
...

References

SHOWING 1-10 OF 93 REFERENCES
Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation
TLDR
Two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data in both continuous and discrete state spaces are proposed.
Toward Minimax Off-policy Value Estimation
TLDR
It is shown that while the so-called regression estimator is asymptotically optimal, for small sample sizes it may perform suboptimally compared to an ideal oracle up to a multiplicative factor that depends on the number of actions.
DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
TLDR
This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
TLDR
This work extends the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators.
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
TLDR
A new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy, based on an extension of the doubly robust estimator and a new way to mix between model based estimates and importance sampling based estimates.
Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
TLDR
A new off-policy estimation method that applies importance sampling directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators is proposed.
Doubly Robust Off-policy Evaluation for Reinforcement Learning
TLDR
This work extends the so-called doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and has low variance, and as a point estimator, it outperforms the most popular importance-sampling estimator and its variants in most occasions.
Minimax Confidence Interval for Off-Policy Evaluation and Policy Optimization
TLDR
This paper unifies minimax methods for off-policy evaluation using value-functions and marginalized importance weights into a single confidence interval (CI) that comes with a special type of double robustness: when either the value-function or importance weight class is well-specified, the CI is valid and its length quantifies the misspecification of the other class.
Safe Policy Improvement with Baseline Bootstrapping
TLDR
This paper adopts the safe policy improvement (SPI) approach, inspired by the knows-what-it-knows paradigms, and develops two computationally efficient bootstrapping algorithms, a value-based and a policy-based, both accompanied with theoretical SPI bounds.
Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes
TLDR
This work considers for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless, and develops a new estimator based on cross-fold estimation of $q-functions and marginalized density ratios, which is term double reinforcement learning (DRL).
...
1
2
3
4
5
...