• Corpus ID: 220496532

Contextual Bandit with Missing Rewards

  title={Contextual Bandit with Missing Rewards},
  author={Djallel Bouneffouf and Sohini Upadhyay and Yasaman Khazaeni},
We consider a novel variant of the contextual bandit problem (i.e., the multi-armed bandit with side-information, or context, available to a decision-maker) where the reward associated with each context-based decision may not always be observed("missing rewards"). This new problem is motivated by certain online settings including clinical trial and ad recommendation applications. In order to address the missing rewards setting, we propose to combine the standard contextual bandit approach with… 

Figures and Tables from this paper

Some performance considerations when using multi-armed bandit algorithms in the presence of missing data
The impact of missing data on several bandit algorithms via a simulation study assuming the rewards are missing at random is investigated and it is illustrated that the problem of missing responses can be alleviated using a simple mean imputation approach.
Spectral Clustering using Eigenspectrum Shape Based Nystrom Sampling
A scalable Nystrom-based clustering algorithm with a new sampling procedure, Centroid Minimum Sum of Squared Similarities (CMS3), and a heuristic on when to use it, which yields competitive low-rank approximations in test datasets compared to the other state-of-the-art methods.
Etat de l'art sur l'application des bandits multi-bras
Un examen complet des principaux développements récents dans de multiples applications réelles des bandits, identifions les tendances actuelles importantes and fournissons de nouvelles perspectives concernant l’avenir de ce domaine en plein essor.


Online learning with Corrupted context: Corrupted Contextual Bandits
This work proposes to combine the standard contextual bandit approach with a classical multi-armed bandit mechanism to address the corrupted-context setting where the context used at each decision may be corrupted ("useless context").
Context Attentive Bandits: Contextual Bandit with Restricted Context
This work adapts the standard multi-armed bandit algorithm known asThompson Sampling to take advantage of the restricted context setting, and proposes two novel algorithms, called the Thompson Sampling with Restricted Context (TSRC) and the Windows Thompson Samplings with Rest restricted Context (WTSRC), for handling stationary and nonstationary environments, respectively.
Contextual Bandit with Adaptive Feature Extraction
The approach starts with an off-line pre-training on unlabeled history of contexts, followed by an online selection and adaptation of encoders, which selects the most appropriate encoding function to extract a feature vector which becomes an input for a contextual bandit.
Hyper-parameter Tuning for the Contextual Bandit
Two algorithms that uses a bandit to find the optimal exploration of the contextual bandit algorithm are presented, which the authors hope is the first step toward the automation of the multi-armed bandit algorithms.
A Neural Networks Committee for the Contextual Bandit Problem
A new contextual bandit algorithm, NeuralBandit, which does not need hypothesis on stationarity of contexts and rewards is presented, and two variants, based on multi-experts approach, are proposed to choose online the parameters of multi-layer perceptrons.
Using Contextual Bandits with Behavioral Constraints for Constrained Online Movie Recommendation
A novel online system, based on an extension of the contextual bandits framework, that learns a set of behavioral constraints by observation and uses these constraints as a guide when making decisions in an online setting while still being reactive to reward feedback is detailed.
A Survey on Practical Applications of Multi-Armed and Contextual Bandits
A taxonomy of common MAB-based applications is introduced and state-of-art for each of those domains is summarized, to identify important current trends and provide new perspectives pertaining to the future of this exciting and fast-growing field.
Contextual Bandit for Active Learning: Active Thompson Sampling
A sequential algorithm named Active Thompson Sampling (ATS) is proposed, which, in each round, assigns a sampling distribution on the pool, samples one point from this distribution, and queries the oracle for this sample point label.
Contextual Bandits with Linear Payoff Functions
An O (√ Td ln (KT ln(T )/δ) ) regret bound is proved that holds with probability 1− δ for the simplest known upper confidence bound algorithm for this problem.