• Corpus ID: 194725539

Prise de décision contextuelle en bande organisée : Quand les bandits font un brainstorming

  title={Prise de d{\'e}cision contextuelle en bande organis{\'e}e : Quand les bandits font un brainstorming},
  author={Robin Allesiardo and Raphael Feraud and Djallel Bouneffouf},
Dans cet article, nous proposons un nouvel algorithme de bandits contextuels, NeuralBandit, ne faisant aucune hypothese de stationnarite sur les contextes et les recompenses. L'algorithme propose utilise plusieurs perceptrons multicouches, chacun apprenant la probabilite qu'une action, etant donnee le contexte, entraine une recompense. A n de regler en ligne les parametres de ces perceptrons multicouches, et notamment les architectures, nous proposons d'utiliser une approche multi-experts. Des… 

Figures from this paper

Etat de l'art sur l'application des bandits multi-bras
Un examen complet des principaux développements récents dans de multiples applications réelles des bandits, identifions les tendances actuelles importantes and fournissons de nouvelles perspectives concernant l’avenir de ce domaine en plein essor.
Finite-time analysis of the multi-armed bandit problem with known trend
  • Djallel Bouneffouf
  • Computer Science
    2016 IEEE Congress on Evolutionary Computation (CEC)
  • 2016
By adapting the standard multi-armed bandit algorithms, this work proposes to study the regret upper bounds of three algorithms: the two first one assumes a stochastic model; and the last one is based on a Bayesian approach.


Efficient bandit algorithms for online multiclass prediction
The Banditron has the ability to learn in a multiclass classification setting with the "bandit" feedback which only reveals whether or not the prediction made by the algorithm was correct or not (but does not necessarily reveal the true label).
A contextual-bandit approach to personalized news article recommendation
This work model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks.
Thompson Sampling for Contextual Bandits with Linear Payoffs
A generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary is designed and analyzed.
Finite-time Analysis of the Multiarmed Bandit Problem
This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
Regret bounds for sleeping experts and bandits
This work compares algorithms against the payoff obtained by the best ordering of the actions, which is a natural benchmark for this type of problem and gives algorithms achieving information-theoretically optimal regret bounds with respect to the best-ordering benchmark.
The Nonstochastic Multiarmed Bandit Problem
A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs.
PAC-Bayesian Analysis of Contextual Bandits
The analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto.
A stochastic bandit algorithm for scratch games
It is shown that the bound of expectation to play a suboptimal arm is lower than the one of UCB1 policy, and an upper condence bound algorithm adapted to this setting is proposed.
Feature Selection as a One-Player Game
This paper formalizes Feature Selection as a Reinforcement Learning problem, leading to a provably optimal though intractable selection policy, and presents an approximation thereof, based on a one-player game approach and relying on the Monte-Carlo tree search UCT (Upper Confidence Tree) proposed by Kocsis and Szepesvari (2006).