• Publications
  • Influence
Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence
A performance bound is proved for the two versions of the UGapE algorithm showing that the two problems are characterized by the same notion of complexity.
Best-Arm Identification in Linear Bandits
The importance of exploiting the global linear structure to improve the estimate of the reward of near-optimal arms is shown and the connection to the G-optimality criterion used in optimal experimental design is pointed out.
Linear Thompson Sampling Revisited
Thompson sampling can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrt{d}$ regret factor compared to a UCB-like approach.
Risk-Aversion in Multi-armed Bandits
This paper introduces a novel setting based on the principle of risk-aversion where the objective is to compete against the arm with the best risk-return trade-off, which proves to be more difficult than the standard multi-arm bandit setting.
Learning Near Optimal Policies with Low Inherent Bellman Error
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to
Finite-sample analysis of least-squares policy iteration
A performance bound is reported for the widely used least-squares policy iteration (LSPI) algorithm based on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function.
Analysis of a Classification-based Policy Iteration Algorithm
The analysis reveals a tradeoff between the estimation and approximation errors in this classification-based policy iteration setting and studies the consistency of the method when there exists a sequence of policy spaces with increasing capacity.
Online Stochastic Optimization under Correlated Bandit Feedback
The high-confidence tree (HCT) algorithm is introduced, a novel anytime χ-armed bandit algorithm, and regret bounds matching the performance of state-of-the-art algorithms in terms of the dependency on number of steps and the near-optimality dimension are derived.
Upper-Confidence-Bound Algorithms for Active Learning in Multi-Armed Bandits
This paper describes two strategies based on pulling the arms proportionally to an upper-bound on their variance and derive regret bounds for these strategies and shows that the performance of these allocation strategies depends not only on the variances of the arms but also on the full shape of their distribution.
LSTD with Random Projections
This work provides a thorough theoretical analysis of the least-squares temporal difference learning algorithm when a space of low dimension is generated with a random projection from a high-dimensional space and derives performance bounds for the resulting algorithm.