• Corpus ID: 55856

High-Probability Regret Bounds for Bandit Online Linear Optimization

@inproceedings{Bartlett2008HighProbabilityRB,
  title={High-Probability Regret Bounds for Bandit Online Linear Optimization},
  author={Peter L. Bartlett and Varsha Dani and Thomas P. Hayes and Sham M. Kakade and Alexander Rakhlin and Ambuj Tewari},
  booktitle={COLT},
  year={2008}
}
We present a modification of the algorithm of Dani et al. [8] for the online linear optimization problem in the bandit setting, which with high probability has regret at most O ∗ ( √ T) against an adaptive adversary. This improves on the previous algorithm [8] whose regret is bounded in expectation against an oblivious adversary. We obtain the same dependence on the dimension (n 3/2) as that exhibited by Dani et al. The results of this paper rest firmly on those of [8] and the remarkable… 
The Price of Bandit Information for Online Optimization
TLDR
This paper presents an algorithm which achieves O*(n3/2 √T) regret and presents lower bounds showing that this gap is at least √n, which is conjecture to be the correct order.
Explore no more: Improved high-probability regret bounds for non-stochastic bandits
TLDR
This work addresses the problem of regret minimization in non-stochastic multi-armed bandit problems, focusing on performance guarantees that hold with high probability, using a simple and intuitive loss-estimation strategy called Implicit exploration (IX) that allows a remarkably clean analysis.
An Optimal Algorithm for Linear Bandits
TLDR
Interestingly, these results show that bandit linear optimization with expert advice in d dimensions is no more difficult (in terms of the achievable regret) than the online d-armed bandit problem with expert Advice (where EXP4 is optimal).
20 D ec 2 01 1 An Optimal Algorithm for Linear Bandits
TLDR
Interestingly, these results show that bandit linear optimization with expert advice in d dimensions is no more difficult (in terms of the achievable regret) than the online d-armed bandit problem with expert Advice (where EXP4 is optimal).
Explore no more : Simple and tight high-probability bounds for non-stochastic bandits
  • Computer Science
  • 2015
TLDR
This paper shows that it is possible to prove high-probability regret bounds without this undesirable exploration component, and relies on a simple and intuitive loss-estimation strategy called Implicit eXploration (IX) that allows a very clean analysis that leads to the best known constant factors in the bounds.
Efficient Bandit Convex Optimization: Beyond Linear Losses
TLDR
The algorithm is a bandit version of the classical regularized Newton’s method, which involves estimation of gradients and Hessians of the loss functions from single point feedback and is the first algorithm for bandit convex optimization with quadratic losses which is efficiently implementable and achieves optimal regret guarantees.
Beating the adaptive bandit with high probability
We provide a principled way of proving Õ(√T) high-probability guarantees for partial-information (bandit) problems over arbitrary convex decision sets. First, we prove a regret guarantee for the
Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits
TLDR
It is shown that follow the regularized leader with the entropic barrier and suitable loss estimators has regret against an adaptive adversary of at most O ( d 2 √ T log( T )) and can be implement in polynomial time, which improves on the best known bound for an efficient algorithm.
Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs
TLDR
The approach uses standard unbiased estimators and relies on a simple increasing learning rate schedule, together with the help of logarithmically homogeneous self-concordant barriers and a strengthened Freedman's inequality to obtain high probability regret bounds for online learning with bandit feedback against an adaptive adversary.
Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback
TLDR
The multi-point bandit setting, in which the player can query each loss function at multiple points, is introduced, and regret bounds that closely resemble bounds for the full information case are proved.
...
...

References

SHOWING 1-10 OF 25 REFERENCES
The Price of Bandit Information for Online Optimization
TLDR
This paper presents an algorithm which achieves O*(n3/2 √T) regret and presents lower bounds showing that this gap is at least √n, which is conjecture to be the correct order.
Robbing the bandit: less regret in online geometric optimization against an adaptive adversary
TLDR
It is proved that, for a large class of full-information online optimization problems, the optimal regret against an adaptive adversary is the same as against a non-adaptive adversary.
Stochastic Linear Optimization under Bandit Feedback
TLDR
A nearly complete characterization of the classical stochastic k-armed bandit problem in terms of both upper and lower bounds for the regret is given, and two variants of an algorithm based on the idea of “upper confidence bounds” are presented.
Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization
TLDR
This work introduces an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal O∗( √ T ) regret and presents a novel connection between online learning and interior point methods.
The Nonstochastic Multiarmed Bandit Problem
TLDR
A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs.
Online decision problems with large strategy sets
TLDR
The theory of generalized multi-armed bandit problems is extended by supplying non-trivial algorithms and lower bounds for cases in which the strategy set is much larger (exponential or infinite) and the cost function class is structured, e.g. by constraining the cost functions to be linear or convex.
Adaptive routing with end-to-end feedback: distributed learning and geometric approaches
TLDR
A second algorithm for online shortest paths is presented, which solves the shortest-path problem using a chain of online decision oracles, one at each node of the graph, which has several advantages over the online linear optimization approach.
The On-Line Shortest Path Problem Under Partial Monitoring
TLDR
The on-line shortest path problem is considered under various models of partial monitoring, and a version of the multi-armed bandit setting for shortest path is discussed, where the decision maker learns only the total weight of the chosen path but not the weights of the individual edges on the path.
Improved Risk Tail Bounds for On-Line Algorithms
TLDR
Tight bounds are derived on the risk of models in the ensemble generated by incremental training of an arbitrary learning algorithm based on uniform convergence arguments, and improves on previous bounds published by the same authors.
Provably competitive adaptive routing
TLDR
The routing protocol presented in this work attempts to provide throughput-competitive route selection against an adaptive adversary and a proof of the convergence time of the algorithm is presented as well as preliminary simulation results.
...
...