# Using upper confidence bounds for online learning

@article{Auer2000UsingUC, title={Using upper confidence bounds for online learning}, author={Peter Auer}, journal={Proceedings 41st Annual Symposium on Foundations of Computer Science}, year={2000}, pages={270-279} }

We show how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation/exploration trade-off. Our technique for designing and analyzing algorithms for such situations is very general and can be applied when an algorithm has to make exploitation-versus-exploration decisions based on uncertain information provided by a random process. We consider two models with such an exploitation/exploration trade-off. For the… Expand

#### Topics from this paper

#### 41 Citations

Using Confidence Bounds for Exploitation-Exploration Trade-offs

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2002

It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off, and improves the regret from O(T3/4) to T1/2. Expand

Improvements to Online Learning Algorithms with Applications to Binary Search Trees

- Computer Science
- 2008

In this work we are motivated by the question: ”How to automatically adapt to, or learn, structure in the past inputs of an algorithm?”. This question might arise from the need to decrease the amount… Expand

Computational Learning Theory

- Computer Science
- Lecture Notes in Computer Science
- 2002

It is shown that the lower bound proof for the sample complexity of agnostic learning with respect to squared loss has a gap, and it is shown one can obtain “fast” sample complexity bounds for nonconvex F for “most” target conditional expectations. Expand

Strategic Exploration in Reinforcement Learning - New Algorithms and Learning Guarantees

- Computer Science
- 2020

This thesis provides new algorithms and theory that enable good performance with respect to existing theoretical frameworks for evaluating RL algorithms (specifically, probably approximately correct) and introduces new stronger evaluation criteria, that may be particularly of interest as RL is applied to more real world problems. Expand

Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

- Computer Science, Mathematics
- NIPS
- 2017

A new framework for theoretically measuring the performance of reinforcement learning algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework, and simultaneously achieves optimal regret and PAC guarantees except for a factor of the horizon. Expand

Adversarial Dueling Bandits

- Computer Science, Mathematics
- ICML
- 2021

The problem of regret minimization in Adversarial Dueling Bandits is introduced, and an algorithm whose $T$-round regret compared to the \emph{Borda-winner} from a set of $K$ items is $\tilde{O}(K^{1/3}T^{2/3})$, as well as a matching $\Omega(K/\Delta^2)$ lower bound. Expand

A Study on Overfitting in Deep Reinforcement Learning

- Computer Science, Mathematics
- ArXiv
- 2018

This paper conducts a systematic study of standard RL agents and finds that they could overfit in various ways and calls for more principled and careful evaluation protocols in RL. Expand

CoinDICE: Off-Policy Confidence Interval Estimation

- Computer Science, Mathematics
- NeurIPS
- 2020

This work proposes CoinDICE, a novel and efficient algorithm for computing confidence intervals in high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, and proves the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Expand

Smarter Sampling in Model-Based Bayesian Reinforcement Learning

- Computer Science
- ECML/PKDD
- 2010

This work proposes a principled method for determining the number of models to sample, based on the parameters of the posterior distribution over models, and establishes bounds on the error in the value function between a random model sample and the mean model from the posterior distributions. Expand

Reinforcement Learning
with Immediate Rewards
and Linear Hypotheses

- Mathematics, Computer Science
- Algorithmica
- 2003

For two cases, one in which a continuous-valued reward is given by applying the unknown linear function, and another in which the probability of receiving the larger of binary-valued rewards is obtained, lower bounds are provided that show that the rate of convergence is nearly optimal. Expand

#### References

SHOWING 1-9 OF 9 REFERENCES

SAMPLE MEAN BASED INDEX POLICIES WITH O(logn) REGRET FOR THE MULTI-ARMED BANDIT PROBLEM

- Mathematics
- 1995

We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases sldwly with time. In their seminal work on… Expand

Associative Reinforcement Learning using Linear Probabilistic Concepts

- Computer Science
- ICML
- 1999

The analysis shows that the worst-case (expected) regret for the methods is almost optimal: the upper bounds grow with the number m of trials and the number n of alternatives like O(m 3=4 n 1=2 ) and O( m 4=5 n 2=5 ), and the lower bound is. Expand

Gambling in a rigged casino: The adversarial multi-armed bandit problem

- Computer Science, Mathematics
- Proceedings of IEEE 36th Annual Foundations of Computer Science
- 1995

A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs is given. Expand

Tracking the Best Expert

- Mathematics, Computer Science
- ICML
- 1995

The generalization allows the sequence to be partitioned into segments and the goal is to bound the additional loss of the algorithm over the sum of the losses of the best experts of each segment to model situations in which the examples change and different experts are best for certain segments of the sequence of examples. Expand

SOME ASPECTS OF THE SEQUENTIAL DESIGN OF EXPERIMENTS

- 2007

1. Introduction. Until recently, statistical theory has been restricted to the design and analysis of sampling experiments in which the size and composition of the samples are completely determined… Expand

Probability Inequalities for sums of Bounded Random Variables

- Mathematics
- 1963

Abstract Upper bounds are derived for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt. It is assumed that the range of each summand of S… Expand

Tracking the best disjunction

- Mathematics, Computer Science
- Proceedings of IEEE 36th Annual Foundations of Computer Science
- 1995

An algorithm that predicts nearly as well as the best disjunction schedule for an arbitrary sequence of examples and combines the tracking capability with existing applications of Winnow to enhance these applications to the shifting case. Expand