Doubly Robust Policy Evaluation and Optimization

@article{Dudk2015DoublyRP,
  title={Doubly Robust Policy Evaluation and Optimization},
  author={Miroslav Dud{\'i}k and D. Erhan and John Langford and Lihong Li},
  journal={ArXiv},
  year={2015},
  volume={abs/1503.02834}
}
We study sequential decision making in environments where rewards are only partially observed, but can be modeled as a function of observed contexts and the chosen action by the decision maker. This setting, known as contextual bandits, encompasses a wide variety of applications such as health care, content recommendation and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the… 

Figures and Tables from this paper

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

TLDR
The SWITCH estimator is proposed, which can use an existing reward model to achieve a better bias-variance tradeoff than IPS and DR and prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.

Stateful Offline Contextual Policy Evaluation and Learning

TLDR
In simulations, it is shown that the advantages of doubly-robust estimation in the single time-step setting, via unbiased and lower-variance estimation, can directly translate to improved out-of-sample policy performance.

Off-Policy Evaluation with Policy-Dependent Optimization Response

The intersection of causal inference and machine learning for decision-making is rapidly expanding, but the default decision criterion remains an average of individual causal outcomes across a

Confounding-Robust Policy Improvement

TLDR
It is demonstrated that hidden confounding can hinder existing policy learning approaches and lead to unwarranted harm, while the robust approach guarantees safety and focuses on well-evidenced improvement, a necessity for making personalized treatment policies learned from observational data reliable in practice.

Minimax-Optimal Policy Learning Under Unobserved Confounding

TLDR
It is demonstrated that hidden confounding can hinder existing policy-learning approaches and lead to unwarranted harm although the robust approach guarantees safety and focuses on well-evidenced improvement, a necessity for making personalized treatment policies learned from observational data reliable in practice.

Off-Policy Risk Assessment in Contextual Bandits

TLDR
This paper proposes Off-Policy Risk Assessment (OPRA), a framework that first estimates a target policy’s CDF and then generates plugin estimates for any collection of Lipschitz risks, providing finite sample guarantees that hold simultaneously over the entire class.

Bayesian Sensitivity Analysis for Offline Policy Evaluation

TLDR
A flexible Bayesian approach to gauge the sensitivity of predicted policy outcomes to unmeasured confounders and demonstrates the efficacy of the method on a large dataset of judicial actions, in which one must decide whether defendants awaiting trial should be required to pay bail or can be released without payment.

Learning When-to-Treat Policies

TLDR
An “advantage doubly robust” estimator for learning such dynamic treatment rules using observational data under the assumption of sequential ignorability is developed, which is practical for policy optimization, and does not need any structural assumptions.

A Spectral Method for Off-Policy Evaluation in Contextual Bandits under Distribution Shift

  • Computer Science
  • 2019
TLDR
An intent shift model is proposed which proposes to introduce an intent variable to capture the distributional shift on context and reward and a consistent spectral estimator for the reweighting factor and its finitesample analysis is proposed and an MSE bound on the performance of this estimator is provided.

Offline Policy Comparison under Limited Historical Agent-Environment Interactions

TLDR
The Limited Data Estimator (LDE) is presented as a simple method for evaluating and comparing policies from a small number of interactions with the environment and is shown to be statistically reliable on policy comparison tasks under mild assumptions on the distribution of the historical data.
...

References

SHOWING 1-10 OF 54 REFERENCES

Doubly Robust Policy Evaluation and Learning

TLDR
It is proved that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies, and is expected to become common practice.

Eligibility Traces for Off-Policy Policy Evaluation

TLDR
This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.

Exploration scavenging

TLDR
Theoretical results hold only when the exploration policy chooses ads independent of side information, an assumption that is typically violated by commercial systems, and it is shown how clever uses of the theory provide non-trivial and realistic applications.

Dynamic Regime Marginal Structural Mean Models for Estimation of Optimal Dynamic Treatment Regimes, Part I: Main Content

TLDR
This article describes an approach to estimate the optimal dynamic treatment regime among a set of enforceable regimes, comprised by regimes defined by simple rules based on a subset of past information and discusses locally efficient, double-robust estimation of the model parameters and of the index of the optimal treatment regime in the set.

Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits

We present and prove properties of a new offline policy evaluator for an exploration learning setting which is superior to previous evaluators. In particular, it simultaneously and correctly

Better Algorithms for Benign Bandits

TLDR
A new algorithm is proposed for the bandit linear optimization problem which obtains a regret bound of O (√Q), where Q is the total variation in the cost functions, and shows that it is possible to incur much less regret in a slowly changing environment even in theBandit setting.

A Robust Method for Estimating Optimal Treatment Regimes

TLDR
A doubly robust augmented inverse probability weighted estimator is used for finding the optimal regime within a class of regimes by finding the regime that optimizes an estimator of overall population mean outcome.

Estimation of Regression Coefficients When Some Regressors are not Always Observed

Abstract In applied problems it is common to specify a model for the conditional mean of a response given a set of regressors. A subset of the regressors may be missing for some study subjects either

Near-Optimal Reinforcement Learning in Polynomial Time

TLDR
New algorithms for reinforcement learning are presented and it is proved that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes.
...