Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process

  title={Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process},
  author={Chengchun Shi and Jin Zhu and Ye Shen and Shikai Luo and Hong Zhu and Rui Song},
This paper is concerned with constructing a confidence interval for a target policy’s value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system… 

Figures and Tables from this paper

Off-Policy Evaluation for Episodic Partially Observable Markov Decision Processes under Non-Parametric Models

A non-parametric identification result is developed for estimating the policy value via a sequence of so-called V-bridge functions with the help of time-dependent proxy variables for off-policy evaluation in POMDPs with continuous states.



Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning

A new estimator based on Double Reinforcement Learning that leverages this structure for OPE and simultaneously uses estimated stationary density ratios and $q-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently.

A Theoretical Analysis of Deep Q-Learning

This work makes the first attempt to theoretically understand the deep Q-network (DQN) algorithm from both algorithmic and statistical perspectives and proposes the Minimax-D QN algorithm for zero-sum Markov game with two players.

Causal Inference Under Unmeasured Confounding With Negative Controls: A Minimax Learning Approach

This paper tackles the primary challenge to causal inference using negative controls: the identification and estimation of these bridge functions, and provides a new identification strategy that avoids both uniqueness and completeness.

Batch Policy Learning in Average Reward Markov Decision Processes

This work proposes a doubly robust estimator for the average reward of a batch policy learning problem in the infinite horizon Markov Decision Process and develops an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class.

Off-Policy Estimation of Long-Term Average Outcomes With Applications to Mobile Health

The measure of performance is the average of proximal outcomes over a long time period should the particular mHealth policy be followed, and an estimator as well as confidence intervals are provided.

Efficient and Adaptive Estimation for Semiparametric Models

Introduction.- Asymptotic Inference for (Finite-Dimensional) Parametric Models.- Information Bounds for Euclidean Parameters in Infinite-Dimensional Models.- Euclidean Parameters: Further Examples.-

ψ0. Given the estimators for pm, pa, we can adopt the general coupled estimation framework to jointly learn Q and ψ0 from the observed data

  • 2020

Dynamic Treatment Regimes: Statistical Methods for Precision Medicine

  • Ying‐Qi Zhao
  • Economics
    Journal of the American Statistical Association
  • 2022
Dynamic treatment regimes (DTRs) are used for managing chronic disease, and fit nicely into the larger paradigm of precision medicine. There is an increasing focus on methodology for dynamic

A Minimax Learning Approach to Off-Policy Evaluation in Partially Observable Markov Decision Processes

This work proposes novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy’s value and the observed data distribution, and proposes minimax estimation methods for learning these bridge functions.

A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes

We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on