• Publications
  • Influence
Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning
TLDR
This paper introduces a new framework for theoretically measuring the performance of such algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework. Expand
The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits
TLDR
We show that no algorithm based on optimism or Thompson sampling will ever achieve the optimal rate, and indeed, can be arbitrarily far from optimal, even in very simple cases. Expand
PAC Bounds for Discounted MDPs
TLDR
We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted Markov Decision Processes with unknown transitions, but known rewards. Expand
Causal Bandits: Learning Good Interventions via Causal Inference
TLDR
We propose a new algorithm that exploits the causal feedback and prove a bound on its simple regret that is strictly better (in all quantities) than algorithms that do not use the additional causal information. Expand
Conservative Bandits
TLDR
We study a novel multi-armed bandit problem that models the challenge faced by a company wishing to explore new strategies to maximize revenue whilst simultaneously maintaining their revenue above a fixed baseline, uniformly over time. Expand
Behaviour Suite for Reinforcement Learning
TLDR
This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short, a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. Expand
Learning with Good Feature Representations in Bandits and in RL with a Generative Model
TLDR
We use the Kiefer-Wolfowitz theorem to prove that even if a learner is given linear features in $\mathbb R^d$ that approximate the rewards in a bandit with a uniform error of $\epsilon$, then searching for an action that is optimal up to $O(\epSilon)$ requires examining essentially all actions. Expand
Model Selection in Contextual Stochastic Bandit Problems
TLDR
We propose a novel and generic smoothing transformation for stochastic bandit algorithms that permits us to obtain $O(\sqrt{T})$ regret guarantees for a wide class of base algorithms when working along with our master. Expand
TopRank: A practical algorithm for online stochastic ranking
TLDR
Online learning to rank is a sequential decision-making problem where in each round the learning agent chooses a list of items and receives feedback in the form of clicks. Expand
Adaptive Exploration in Linear Contextual Bandit
TLDR
We propose an optimisation-based algorithm that is asymptotically optimal, computationally efficient and empirically wellbehaved in finite-time regimes. Expand
...
1
2
3
4
5
...