# Using upper confidence bounds for online learning

@article{Auer2000UsingUC,
title={Using upper confidence bounds for online learning},
author={Peter Auer},
journal={Proceedings 41st Annual Symposium on Foundations of Computer Science},
year={2000},
pages={270-279}
}
• P. Auer
• Published 2000
• Computer Science
• Proceedings 41st Annual Symposium on Foundations of Computer Science
We show how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation/exploration trade-off. Our technique for designing and analyzing algorithms for such situations is very general and can be applied when an algorithm has to make exploitation-versus-exploration decisions based on uncertain information provided by a random process. We consider two models with such an exploitation/exploration trade-off. For the… Expand
41 Citations

#### Topics from this paper

Using Confidence Bounds for Exploitation-Exploration Trade-offs
• P. Auer
• Mathematics, Computer Science
• J. Mach. Learn. Res.
• 2002
It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off, and improves the regret from O(T3/4) to T1/2. Expand
Improvements to Online Learning Algorithms with Applications to Binary Search Trees
In this work we are motivated by the question: ”How to automatically adapt to, or learn, structure in the past inputs of an algorithm?”. This question might arise from the need to decrease the amountExpand
Computational Learning Theory
• Computer Science
• Lecture Notes in Computer Science
• 2002
It is shown that the lower bound proof for the sample complexity of agnostic learning with respect to squared loss has a gap, and it is shown one can obtain “fast” sample complexity bounds for nonconvex F for “most” target conditional expectations. Expand
Strategic Exploration in Reinforcement Learning - New Algorithms and Learning Guarantees
This thesis provides new algorithms and theory that enable good performance with respect to existing theoretical frameworks for evaluating RL algorithms (specifically, probably approximately correct) and introduces new stronger evaluation criteria, that may be particularly of interest as RL is applied to more real world problems. Expand
Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning
• Computer Science, Mathematics
• NIPS
• 2017
A new framework for theoretically measuring the performance of reinforcement learning algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework, and simultaneously achieves optimal regret and PAC guarantees except for a factor of the horizon. Expand
• Computer Science, Mathematics
• ICML
• 2021
The problem of regret minimization in Adversarial Dueling Bandits is introduced, and an algorithm whose $T$-round regret compared to the \emph{Borda-winner} from a set of $K$ items is $\tilde{O}(K^{1/3}T^{2/3})$, as well as a matching $\Omega(K/\Delta^2)$ lower bound. Expand
A Study on Overfitting in Deep Reinforcement Learning
• Computer Science, Mathematics
• ArXiv
• 2018
This paper conducts a systematic study of standard RL agents and finds that they could overfit in various ways and calls for more principled and careful evaluation protocols in RL. Expand
CoinDICE: Off-Policy Confidence Interval Estimation
• Computer Science, Mathematics
• NeurIPS
• 2020
This work proposes CoinDICE, a novel and efficient algorithm for computing confidence intervals in high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, and proves the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Expand
Smarter Sampling in Model-Based Bayesian Reinforcement Learning
• Computer Science
• ECML/PKDD
• 2010
This work proposes a principled method for determining the number of models to sample, based on the parameters of the posterior distribution over models, and establishes bounds on the error in the value function between a random model sample and the mean model from the posterior distributions. Expand
Reinforcement Learning with Immediate Rewards and Linear Hypotheses
• Mathematics, Computer Science
• Algorithmica
• 2003
For two cases, one in which a continuous-valued reward is given by applying the unknown linear function, and another in which the probability of receiving the larger of binary-valued rewards is obtained, lower bounds are provided that show that the rate of convergence is nearly optimal. Expand

#### References

SHOWING 1-9 OF 9 REFERENCES
SAMPLE MEAN BASED INDEX POLICIES WITH O(logn) REGRET FOR THE MULTI-ARMED BANDIT PROBLEM
We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases sldwly with time. In their seminal work onExpand
Associative Reinforcement Learning using Linear Probabilistic Concepts
• Computer Science
• ICML
• 1999
The analysis shows that the worst-case (expected) regret for the methods is almost optimal: the upper bounds grow with the number m of trials and the number n of alternatives like O(m 3=4 n 1=2 ) and O( m 4=5 n 2=5 ), and the lower bound is. Expand
Gambling in a rigged casino: The adversarial multi-armed bandit problem
• Computer Science, Mathematics
• Proceedings of IEEE 36th Annual Foundations of Computer Science
• 1995
A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs is given. Expand
Tracking the Best Expert
• Mathematics, Computer Science
• ICML
• 1995
The generalization allows the sequence to be partitioned into segments and the goal is to bound the additional loss of the algorithm over the sum of the losses of the best experts of each segment to model situations in which the examples change and different experts are best for certain segments of the sequence of examples. Expand
SOME ASPECTS OF THE SEQUENTIAL DESIGN OF EXPERIMENTS
1. Introduction. Until recently, statistical theory has been restricted to the design and analysis of sampling experiments in which the size and composition of the samples are completely determinedExpand
Probability Inequalities for sums of Bounded Random Variables
Abstract Upper bounds are derived for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt. It is assumed that the range of each summand of SExpand
Tracking the best disjunction
• Mathematics, Computer Science
• Proceedings of IEEE 36th Annual Foundations of Computer Science
• 1995
An algorithm that predicts nearly as well as the best disjunction schedule for an arbitrary sequence of examples and combines the tracking capability with existing applications of Winnow to enhance these applications to the shifting case. Expand