Leveraging Initial Hints for Free in Stochastic Linear Bandits

@inproceedings{Cutkosky2022LeveragingIH,
  title={Leveraging Initial Hints for Free in Stochastic Linear Bandits},
  author={Ashok Cutkosky and Christoph Dann and Abhimanyu Das and Qiuyi Zhang},
  booktitle={ALT},
  year={2022}
}
We study the setting of optimizing with bandit feedback with additional prior knowledge provided to the learner in the form of an initial hint of the optimal action. We present a novel algorithm for stochastic linear bandits that uses this hint to improve its regret to Õ( √ T ) when the hint is accurate, while maintaining a minimax-optimal Õ(d √ T ) regret independent of the quality of the hint. Furthermore, we provide a Pareto frontier of tight tradeoffs between best-case and worstcase regret… 

References

SHOWING 1-10 OF 31 REFERENCES
Thompson Sampling for Contextual Bandits with Linear Payoffs
TLDR
A generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary is designed and analyzed.
Corralling a Band of Bandit Algorithms
TLDR
This work designs a version of Online Mirror Descent with a special mirror map together with a sophisticated learning rate scheme and shows that this approach manages to achieve a more delicate balance between exploiting and exploring base algorithms than previous works yielding superior regret bounds.
Linear Thompson Sampling Revisited
TLDR
Thompson sampling can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrt{d}$ regret factor compared to a UCB-like approach.
Dynamic Balancing for Model Selection in Bandits and RL
TLDR
This work proposes a framework for model selection by combining base algorithms in stochastic bandits and reinforcement learning that simultaneously recovers previous worst-case regret bounds, while also obtaining much smaller regret in natural scenarios when some base learners significantly exceed their candidate bounds.
Adapting to misspecification in contextual bandits with offline regression oracles
TLDR
This work proposes a simple family of contextual bandit algorithms that adapt to missespecification error by reverting to a good safe policy when there is evidence that misspecification is causing a regret increase.
Conservative Contextual Linear Bandits
TLDR
A safe contextual linear bandit algorithm, called conservative linear UCB (CLUCB), is developed that simultaneously minimizes its regret and satisfies the safety constraint, i.e., maintains its performance above a fixed percentage of the performance of a baseline strategy, uniformly over time.
Conservative Bandits
We study a novel multi-armed bandit problem that models the challenge faced by a company wishing to explore new strategies to maximize revenue whilst simultaneously maintaining their revenue above a
The Pareto Regret Frontier for Bandits
TLDR
It is shown that the price for such unbalanced worst-case regret guarantees is rather high, and upper bounds in both the stochastic and adversarial settings showing that this result cannot be improved are given.
Stochastic Linear Optimization under Bandit Feedback
TLDR
A nearly complete characterization of the classical stochastic k-armed bandit problem in terms of both upper and lower bounds for the regret is given, and two variants of an algorithm based on the idea of “upper confidence bounds” are presented.
Improved Algorithms for Linear Stochastic Bandits
TLDR
A simple modification of Auer's UCB algorithm achieves with high probability constant regret and improves the regret bound by a logarithmic factor, though experiments show a vast improvement.
...
...