This paper introduces a new framework for theoretically measuring the performance of such algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework.Expand

We show that no algorithm based on optimism or Thompson sampling will ever achieve the optimal rate, and indeed, can be arbitrarily far from optimal, even in very simple cases.Expand

We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted Markov Decision Processes with unknown transitions, but known rewards.Expand

We propose a new algorithm that exploits the causal feedback and prove a bound on its simple regret that is strictly better (in all quantities) than algorithms that do not use the additional causal information.Expand

We study a novel multi-armed bandit problem that models the challenge faced by a company wishing to explore new strategies to maximize revenue whilst simultaneously maintaining their revenue above a fixed baseline, uniformly over time.Expand

This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short, a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives.Expand

We use the Kiefer-Wolfowitz theorem to prove that even if a learner is given linear features in $\mathbb R^d$ that approximate the rewards in a bandit with a uniform error of $\epsilon$, then searching for an action that is optimal up to $O(\epSilon)$ requires examining essentially all actions.Expand

We propose a novel and generic smoothing transformation for stochastic bandit algorithms that permits us to obtain $O(\sqrt{T})$ regret guarantees for a wide class of base algorithms when working along with our master.Expand

Online learning to rank is a sequential decision-making problem where in each round the learning agent chooses a list of items and receives feedback in the form of clicks.Expand

We propose an optimisation-based algorithm that is asymptotically optimal, computationally efficient and empirically wellbehaved in finite-time regimes.Expand