A Neural Networks Committee for the Contextual Bandit Problem

@inproceedings{Allesiardo2014ANN,
  title={A Neural Networks Committee for the Contextual Bandit Problem},
  author={Robin Allesiardo and Rapha{\"e}l F{\'e}raud and Djallel Bouneffouf},
  booktitle={ICONIP},
  year={2014}
}
This paper presents a new contextual bandit algorithm, NeuralBandit, which does not need hypothesis on stationarity of contexts and rewards. [...] Key Method Two variants, based on multi-experts approach, are proposed to choose online the parameters of multi-layer perceptrons. The proposed algorithms are successfully tested on a large dataset with and without stationarity of rewards.Expand
Contextual Bandit with Missing Rewards
TLDR
Unlike standard contextual bandit methods, by leveraging clustering to estimate missing reward, this work is able to learn from each incoming event, even those with missing rewards. Expand
Adaptive Representation Selection in Contextual Bandit.
TLDR
An approach for improving the performance of contextual bandit in such setting, via adaptive, dynamic representation learning, which combines offline pre-training on unlabeled history of contexts with online selection and modification of embedding functions is proposed. Expand
Hyper-parameter Tuning for the Contextual Bandit
TLDR
Two algorithms that uses a bandit to find the optimal exploration of the contextual bandit algorithm are presented, which the authors hope is the first step toward the automation of the multi-armed bandit algorithms. Expand
Online Semi-Supervised Learning with Bandit Feedback
TLDR
This work formulate a new problem at the intersection of semi-supervised learning and contextual bandits, motivated by several applications including clini-cal trials and ad recommendations, and takes the best of both approaches to develop multi-GCN embedded contextual bandit. Expand
Online learning with Corrupted context: Corrupted Contextual Bandits
TLDR
This work proposes to combine the standard contextual bandit approach with a classical multi-armed bandit mechanism to address the corrupted-context setting where the context used at each decision may be corrupted ("useless context"). Expand
Adaptive Representation Selection in Contextual Bandit with Unlabeled History
TLDR
An approach for improving the performance of contextual bandit in such setting, via adaptive, dynamic representation learning, which combines offline pre-training on unlabeled history of contexts with online selection and modification of embedding functions is proposed. Expand
Neural Contextual Bandits with Upper Confidence Bound-Based Exploration
TLDR
The NeuralUCB algorithm is proposed, which leverages the representation power of deep neural networks and uses a neural network-based random feature mapping to construct an upper confidence bound (UCB) of reward for efficient exploration. Expand
Double-Linear Thompson Sampling for Context-Attentive Bandits
TLDR
An online learning frame-work, motivated by various practical applications, where due to observation costs only a small subset of a potentially large number of context variables can be observed at each iteration, is analyzed and a novel algorithm, called Context-Attentive Thompson Sampling (CATS), is derived. Expand
Context Attentive Bandits: Contextual Bandit with Restricted Context
TLDR
This work adapts the standard multi-armed bandit algorithm known asThompson Sampling to take advantage of the restricted context setting, and proposes two novel algorithms, called the Thompson Sampling with Restricted Context (TSRC) and the Windows Thompson Samplings with Rest restricted Context (WTSRC), for handling stationary and nonstationary environments, respectively. Expand
Contextual Bandit with Adaptive Feature Extraction
TLDR
The approach starts with an off-line pre-training on unlabeled history of contexts, followed by an online selection and adaptation of encoders, which selects the most appropriate encoding function to extract a feature vector which becomes an input for a contextual bandit. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 27 REFERENCES
Efficient bandit algorithms for online multiclass prediction
TLDR
The Banditron has the ability to learn in a multiclass classification setting with the "bandit" feedback which only reveals whether or not the prediction made by the algorithm was correct or not (but does not necessarily reveal the true label). Expand
Efficient Optimal Learning for Contextual Bandits
TLDR
This work provides the first efficient algorithm with an optimal regret and uses a cost sensitive classification learner as an oracle and has a running time polylog(N), where N is the number of classification rules among which the oracle might choose. Expand
Thompson Sampling for Contextual Bandits with Linear Payoffs
TLDR
A generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary is designed and analyzed. Expand
Contextual Bandits with Linear Payoff Functions
TLDR
An O (√ Td ln (KT ln(T )/δ) ) regret bound is proved that holds with probability 1− δ for the simplest known upper confidence bound algorithm for this problem. Expand
Mortal Multi-Armed Bandits
TLDR
A new variant of the k-armed bandit problem, where arms have (stochastic) lifetime after which they expire, motivated by e-commerce applications and an optimal algorithm for the state-aware (deterministic reward function) case is presented. Expand
A contextual-bandit approach to personalized news article recommendation
TLDR
This work model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks. Expand
The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
TLDR
An algorithm for multi-armed bandits with observable side information with no knowledge of a time horizon and the regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. Expand
Playing Atari with Deep Reinforcement Learning
TLDR
This work presents the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning, which outperforms all previous approaches on six of the games and surpasses a human expert on three of them. Expand
On-line learning for very large data sets
TLDR
This paper reconsiders the convergence speed in terms of how fast a learning algorithm optimizes the testing error and shows the superiority of the well designed stochastic learning algorithm. Expand
Regret bounds for sleeping experts and bandits
TLDR
This work compares algorithms against the payoff obtained by the best ordering of the actions, which is a natural benchmark for this type of problem and gives algorithms achieving information-theoretically optimal regret bounds with respect to the best-ordering benchmark. Expand
...
1
2
3
...