• Corpus ID: 227228975

Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior Policies

@article{Lai2020OptimalMW,
  title={Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior Policies},
  author={Jinlin Lai and Lixin Zou and Jiaxing Song},
  journal={ArXiv},
  year={2020},
  volume={abs/2011.14359}
}
Off-policy evaluation is a key component of reinforcement learning which evaluates a target policy with offline data collected from behavior policies. It is a crucial step towards safe reinforcement learning and has been used in advertisement, recommender systems and many other applications. In these applications, sometimes the offline data is collected from multiple behavior policies. Previous works regard data from different behavior policies equally. Nevertheless, some behavior policies are… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 31 REFERENCES
More Efficient Off-Policy Evaluation through Regularized Targeted Learning
TLDR
A novel doubly-robust estimator is introduced for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistical causal inference literature, which shows empirically that this estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification.
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
TLDR
This work extends the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators.
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
TLDR
A new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy, based on an extension of the doubly robust estimator and a new way to mix between model based estimates and importance sampling based estimates.
Infinite-horizon Off-Policy Policy Evaluation with Multiple Behavior Policies
TLDR
Estimated mixture policy (EMP) is proposed, a novel class of partially policy-agnostic methods to accurately estimate quantities that are generated by multiple behavior policies and shows that the algorithm offers significantly improved accuracy compared to the state-of-the-art methods.
Off-policy evaluation for slate recommendation
TLDR
A new practical estimator that uses logged data to estimate a policy's performance and is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance.
Doubly robust off-policy evaluation with shrinkage
We propose a new framework for designing estimators for off-policy evaluation in contextual bandits. Our approach is based on the asymptotically optimal doubly robust estimator, but we shrink the
Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling
TLDR
A marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step and is believed to be the first OPE estimation error bound with a polynomial dependence on the RL horizon $H.
CAB: Continuous Adaptive Blending for Policy Evaluation and Learning
TLDR
This analysis identifies a new counterfactual estimator – called Continuous Adaptive Blending (CAB) – which enjoys many advantageous theoretical and practical properties and can have less variance than Doubly Robust and IPS estimators.
Optimal and Adaptive Off-policy Evaluation in Contextual Bandits
TLDR
The SWITCH estimator is proposed, which can use an existing reward model to achieve a better bias-variance tradeoff than IPS and DR and prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.
DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
TLDR
This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.
...
1
2
3
4
...