Share This Author
Recommendations as Treatments: Debiasing Learning and Evaluation
- Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, N. Chandak, T. Joachims
- Computer ScienceICML
- 17 February 2016
This paper provides a principled approach to handle selection biases by adapting models and estimation techniques from causal inference, which leads to unbiased performance estimators despite biased data, and to a matrix factorization method that provides substantially improved prediction performance on real-world data.
Unbiased Learning-to-Rank with Biased Feedback
A counterfactual inference framework is presented that provides the theoretical basis for unbiased LTR via Empirical Risk Minimization despite biased data, and a Propensity-Weighted Ranking SVM is derived for discriminative learning from implicit feedback, where click models take the role of the propensity estimator.
The Self-Normalized Estimator for Counterfactual Learning
This paper identifies a severe problem of the counterfactual risk estimator typically used in batch learning from logged bandit feedback (BLBF), and proposes the use of an alternative estimator that…
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
This work develops a learning principle and an efficient algorithm for batch learning from logged bandit feedback and shows how CRM can be used to derive a new learning method - called Policy Optimizer for Exponential Models (POEM - for learning stochastic linear rules for structured output prediction.
Batch learning from logged bandit feedback through counterfactual risk minimization
The empirical results show that the CRM objective implemented in POEM provides improved robustness and generalization performance compared to the state-of-the-art, and a decomposition of the POEM objective that enables efficient stochastic gradient optimization is presented.
Off-policy evaluation for slate recommendation
A new practical estimator that uses logged data to estimate a policy's performance and is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance.
Deep Learning with Logged Bandit Feedback
A Counterfactual Risk Minimization (CRM) approach for training deep networks using an equivariant empirical risk estimator with variance regularization, BanditNet, is proposed and it is shown how the resulting objective can be decomposed in a way that allows Stochastic Gradient Descent (SGD) training.
Provably Good Batch Reinforcement Learning Without Great Exploration
It is shown that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees on the performance of the output policy, and in certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability.
Off-Policy Policy Gradient with State Distribution Correction
- Yao Liu, Adith Swaminathan, Alekh Agarwal, Emma Brunskill
- Political Science, EconomicsUAI
- 17 April 2019
This work builds on recent progress for estimating the ratio of the state distributions under behavior and evaluation policies for policy evaluation, and presents an off-policy policy gradient optimization technique that can account for this mismatch in distributions.
Large-scale Validation of Counterfactual Learning Methods: A Test-Bed
- Damien Lefortier, Adith Swaminathan, Xiaotao Gu, T. Joachims, M. de Rijke
- Computer ScienceArXiv
- 1 December 2016
The results show experimental evidence that recent off-policy learning methods can improve upon state-of-the-art supervised learning techniques on a large-scale real-world data set.