• Corpus ID: 235435868

Control Variates for Slate Off-Policy Evaluation

@inproceedings{Vlassis2021ControlVF,
  title={Control Variates for Slate Off-Policy Evaluation},
  author={Nikos A. Vlassis and Ashok Chandrashekar and Fernando Amat Gil and Nathan Kallus},
  booktitle={NeurIPS},
  year={2021}
}
We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions, often termed slates. The problem is common to recommender systems and user-interface optimization, and it is particularly challenging because of the combinatorially-sized action space. Swaminathan et al. (2017) have proposed the pseudoinverse (PI) estimator under the assumption that the conditional mean rewards are additive in actions. Using control variates, we consider a large… 

Figures from this paper

Off-Policy Evaluation for Large Action Spaces via Embeddings

A new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space is proposed, and the empirical performance improvement can be substantial, en-abling reliable OPE even when existing estimators collapse due to a large number of actions.

Safe Optimal Design with Applications in Off-Policy Learning

This work proposes a safe optimal logging policy for the case when no side information about the actions’ expected rewards is available and improves upon this design by considering side information and also extends both approaches to a large number of actions with a linear reward model.

Safe Data Collection for Offline and Online Policy Learning

The Safe Phased-Elimination ( SafePE) algorithm is developed that can achieve optimal regret bound with only logarithmic number of policy updates and is applicable to the safe online learning setting.

Safe Optimal Design with Applications in Policy Learning

This work proposes a safe optimal logging policy for the case when no side information about the actions’ expected rewards is available and improves upon this design by considering side information and also extends both approaches to a large number of actions with a linear reward model.

References

SHOWING 1-10 OF 48 REFERENCES

Off-policy evaluation for slate recommendation

A new practical estimator that uses logged data to estimate a policy's performance and is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance.

More Efficient Off-Policy Evaluation through Regularized Targeted Learning

A novel doubly-robust estimator is introduced for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistical causal inference literature, which shows empirically that this estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification.

Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions

This work proposes a new counterfactual estimator that allows for sequential interactions in the rewards with lower variance in an asymptotically unbiased manner and outperforms existing methods in terms of bias and data efficiency for the sequential track recommendations problem.

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

The SWITCH estimator is proposed, which can use an existing reward model to achieve a better bias-variance tradeoff than IPS and DR and prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.

Learning from eXtreme Bandit Feedback

This paper introduces a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime and employs this estimator in a novel algorithmic procedure---named Policy Optimization for eXtreme Models (POXM)---for learning from bandit feedback on XMC tasks.

Doubly robust off-policy evaluation with shrinkage

We propose a new framework for designing estimators for off-policy evaluation in contextual bandits. Our approach is based on the asymptotically optimal doubly robust estimator, but we shrink the

Top-K Off-Policy Correction for a REINFORCE Recommender System

This work presents a general recipe of addressing biases in a production top-K recommender system at Youtube, built with a policy-gradient-based algorithm, i.e. REINFORCE, and proposes a noveltop-K off-policy correction to account for the policy recommending multiple items at a time.

Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

This work develops a learning principle and an efficient algorithm for batch learning from logged bandit feedback and shows how CRM can be used to derive a new learning method - called Policy Optimizer for Exponential Models (POEM - for learning stochastic linear rules for structured output prediction.

On the Design of Estimators for Bandit Off-Policy Evaluation

A framework for designing estimators for bandit offpolicy evaluation is described and a simple design for contextual bandits is decribed that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets.

Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

This work proposes new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS.