Off-policy Bandits with Deficient Support

  title={Off-policy Bandits with Deficient Support},
  author={Noveen Sachdeva and Yi-Hsun Su and Thorsten Joachims},
  journal={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
Learning effective contextual-bandit policies from past actions of a deployed system is highly desirable in many settings (e.g. voice assistants, recommendation, search), since it enables the reuse of large amounts of log data. State-of-the-art methods for such off-policy learning, however, are based on inverse propensity score (IPS) weighting. A key theoretical requirement of IPS weighting is that the policy that logged the data has "full support", which typically translates into requiring non… 

Figures and Tables from this paper

Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support

This work proposes a novel approach that uses a hybrid of offline learning with online exploration that determines an optimal policy with theoretical guarantees using the minimal number of online explorations.

Off-Policy Evaluation for Large Action Spaces via Embeddings

A new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space is proposed, and the empirical performance improvement can be substantial, en-abling reliable OPE even when existing estimators collapse due to a large number of actions.

Pessimistic Reward Models for Off-Policy Learning in Recommendation

This work proposes and validate a general pessimistic reward modelling approach for off-policy learning in recommendation and shows how it alleviates a well-known decision making phenomenon known as the Optimiser’s Curse, and draws parallels with existing work on pessimistic policy learning.

Counterfactual Learning with General Data-generating Policies

An OPE method is developed for a class of both full support and deficient support logging policies in contextual-bandit settings that includes deterministic bandit as well as deterministic decision-making based on supervised and unsupervised learning.

Counterfactual Evaluation and Learning for Interactive Systems: Foundations, Implementations, and Recent Advances

The fundamentals of OPE/OPL are introduced and theoretical and empirical comparisons of conventional methods are provided and emerging practical challenges such as how to handle large action spaces, distributional shift, and hyper-parameter tuning are covered.

Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances

The fundamentals of OPE/OPL are introduced and theoretical and empirical comparisons of conventional methods are provided and emerging practical challenges such as how to take into account combinatorial actions, distributional shift, fairness of exposure, and two-sided market structures are covered.

Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy

This paper derives an asymptotically normal estimator of the value of an evaluation policy from a martingale difference sequence for the dependent samples and simultaneously solves the deficient support problem.

Off-Policy Actor-critic for Recommender Systems

The key designs in setting up an off-policy actor-critic agent for production recommender systems are shared and it is demonstrated in offline and live experiments that the new framework out-performs baseline and improves long term user experience.

Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits

This paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem, and explores multiple approaches to computing MVAL policies efficiently, finding that they can be substantially more effective in decreasing the varianceof an estimator than näıve approaches.

Local Policy Improvement for Recommender Systems

It is argued that this local policy improvement paradigm is particularly well suited for recommender systems, given that in practice the previously-deployed policy is typically of reasonably high quality, and furthermore it tends to be re-trained frequently and gets continuously updated.



Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

The SWITCH estimator is proposed, which can use an existing reward model to achieve a better bias-variance tradeoff than IPS and DR and prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.

Doubly Robust Policy Evaluation and Learning

It is proved that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies, and is expected to become common practice.

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

A practical algorithm, bootstrapping error accumulation reduction (BEAR), is proposed and it is demonstrated that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.

CAB: Continuous Adaptive Blending for Policy Evaluation and Learning

This paper presents and analyzes a family of counterfactual estimators which sub-sumes most estimators proposed to date and identifies a new estimator – called Continuous Adaptive Blending (CAB) – which enjoys many advantageous theoretical and practical properties.

Safe Policy Improvement with Soft Baseline Bootstrapping

This work improves more precisely the SPI with Baseline Bootstrapping algorithm (SPIBB) by allowing the policy search over a wider set of policies and adopts a softer strategy that controls the error in the value estimates by constraining the policy change according to the local model uncertainty.

Learning from Logged Implicit Exploration Data

We provide a sound and consistent foundation for the use of nonrandom exploration data in "contextual bandit" or "partially labeled" settings where only the value of a chosen action is learned. The

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

This work extends the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators.

The Self-Normalized Estimator for Counterfactual Learning

This paper identifies a severe problem of the counterfactual risk estimator typically used in batch learning from logged bandit feedback (BLBF), and proposes the use of an alternative estimator that

Deep Learning with Logged Bandit Feedback

A Counterfactual Risk Minimization (CRM) approach for training deep networks using an equivariant empirical risk estimator with variance regularization, BanditNet, is proposed and it is shown how the resulting objective can be decomposed in a way that allows Stochastic Gradient Descent (SGD) training.

Safe Policy Improvement with Baseline Bootstrapping

This paper adopts the safe policy improvement (SPI) approach, inspired by the knows-what-it-knows paradigms, and develops two computationally efficient bootstrapping algorithms, a value-based and a policy-based, both accompanied with theoretical SPI bounds.