• Corpus ID: 49904719

# An Optimal Algorithm for Stochastic and Adversarial Bandits

@article{Zimmert2018AnOA,
title={An Optimal Algorithm for Stochastic and Adversarial Bandits},
author={Julian Zimmert and Yevgeny Seldin},
journal={ArXiv},
year={2018},
volume={abs/1807.07623}
}
• Published 19 July 2018
• Computer Science
• ArXiv
We provide an algorithm that achieves the optimal (up to constants) finite time regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The result provides a negative answer to the open problem of whether extra price has to be paid for the lack of information about the adversariality/stochasticity of the environment. We provide a complete characterization of online mirror descent algorithms based on Tsallis entropy and show that the…
164 Citations

## Tables from this paper

• Computer Science
ICML
• 2019
This work develops the first general semi-bandit algorithm that simultaneously achieves regret for stochastic environments and adversarial environments without knowledge of the regime or the number of rounds $T$.
An algorithm for combinatorial semi-bandits with a hybrid regret bound that includes a best-of-three-worlds guarantee and multiple data-dependent regret bounds is proposed, which implies that the algorithm will perform better as long as the environment is "easy" in terms of certain metrics.
• Computer Science
ICML
• 2021
This work develops linear bandit algorithms that automatically adapt to different environments and additionally enjoys minimax-optimal regret in completely adversarial environments, which is the first of this kind to the authors' knowledge.
• Computer Science
ArXiv
• 2019
A first-order bound is proved for a modified variant of the INF strategy by Audibert and Bubeck [2009], without sacrificing worst case optimality or modifying the loss estimators.
• Computer Science
UAI
• 2019
A first-order bound is proved for a modified variant of the INF strategy by Audibert and Bubeck [2009], without sacrificing worst case optimality or modifying the loss estimators.
• Computer Science
ICML
• 2021
An algorithm for stochastic and adversarial multiarmed bandits with switching costs, where the algorithm pays a price λ every time it switches the arm being played, based on adaptation of the Tsallis-INF algorithm.
• Computer Science
COLT 2020
• 2020
A new algorithm is derived using regularization by Tsallis entropy to achieve best of both worlds guarantees and achieves the minimax optimal O ( √ KT ) regret bound, slightly improving on the result of Avner et al.
A new algorithm is provided with a new hybrid regret bound that implies logarithmic regret in the stochastic regime and multiple data-dependent regret bounds in the adversarial regime, including bounds dependent on cumulative loss, total variation, and losssequence path-length.
• Computer Science
ArXiv
• 2022
This paper proposes a new algorithm based on the principle of optimism in the face of uncertainty that achieves the near-optimal regret for both corrupted and uncorrupted cases simultaneously and shows that for both known C and unknown C cases, the algorithm with proper choice of hyperparameter achieves a regret that nearly matches the lower bounds.
• Computer Science
AAAI
• 2020
This paper investigates the attack model where an adversary attacks with a certain probability at each round, and its attack value can be arbitrary and unbounded if it attacks, and provides a high probability guarantee of O(log T) regret with respect to random rewards and random occurrence of attacks.

## References

SHOWING 1-10 OF 34 REFERENCES

• Computer Science
ICML
• 2019
This work develops the first general semi-bandit algorithm that simultaneously achieves regret for stochastic environments and adversarial environments without knowledge of the regime or the number of rounds $T$.
• Computer Science
ICML
• 2014
The algorithm is based on augmentation of the EXP3 algorithm with a new control lever in the form of exploration parameters that are tailored individually for each arm, and retains "logarithmic" regret guarantee in the stochastic regime even when some observations are contaminated by an adversary.
• Computer Science
COLT
• 2012
SAO (Stochastic and Adversarial Optimal) combines the O( √ n) worst-case regret of Exp3 and the (poly)logarithmic regret of UCB1 for stochastic rewards for adversarial rewards.
• Computer Science
COLT
• 2018
The main idea of the algorithm is to apply the optimism and adaptivity techniques to the well-known Online Mirror Descent framework with a special log-barrier regularizer to come up with appropriate optimistic predictions and correction terms in this framework.
• Computer Science, Mathematics
COLT 2009
• 2009
This work fills in a long open gap in the characterization of the minimax rate for the multi-armed bandit prob- lem and proposes a new family of randomized algorithms based on an implicit normalization, as well as a new analysis.
• Computer Science
COLT
• 2018
A lower bound is given that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards, and a simple parameter-free algorithm is designed and shown that its probability of error matches the lower bound in stoChastic problems, and it is also robust to adversary rewards.
• Computer Science
COLT
• 2017
A new strategy for gap estimation in randomized algorithms for multiarmed bandits and combine it with the EXP3++ algorithm of Seldin and Slivkins (2014) to reduce dependence of regret on a time horizon and eliminate an additive factor of order.
• Computer Science
NIPS
• 2015
A novel family of algorithms with minimax optimal regret guarantees is defined using the notion of convex smoothing, and it is shown that a wide class of perturbation methods achieve a near-optimal regret as low as O(√NT log N), as long as the perturbations distribution has a bounded hazard function.
• Computer Science
STOC
• 2018
We introduce a new model of stochastic bandits with adversarial corruptions which aims to capture settings where most of the input follows a stochastic pattern but some fraction of it can be
• Computer Science
ArXiv
• 2018
It is proved that a geometric doubling trick can be used to conserve (minimax) bounds in $R_T = O(\sqrt{T})$ but cannot conserve (distribution-dependent), and insights are given as to why exponential doubling tricks may be better, as they conserve bounds in R_T + O(\log T) and are close to conserving bounds in T.