# Linear bandits with limited adaptivity and learning distributional optimal design

@article{Ruan2021LinearBW, title={Linear bandits with limited adaptivity and learning distributional optimal design}, author={Yufei Ruan and Jiaqi Yang and Yuanshuo Zhou}, journal={Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing}, year={2021} }

Motivated by practical needs such as large-scale learning, we study the impact of adaptivity constraints to linear contextual bandits, a central problem in online learning and decision making. We consider two popular limited adaptivity models in literature: batch learning and rare policy switches. We show that, when the context vectors are adversarially chosen in d-dimensional linear contextual bandits, the learner needs O(d logd logT) policy switches to achieve the minimax-optimal regret, and…

## Tables from this paper

## 17 Citations

Provable Benefits of Representation Learning in Linear Bandits

- Computer ScienceArXiv
- 2020

A new algorithm is presented which achieves a corresponding regret bound which demonstrates the benefit of representation learning in certain regimes, and an $\Omega(T\sqrt{kN} + \sqrt {dkNT})$ regret lower bound is provided, showing that the algorithm is minimax-optimal up to poly-logarithmic factors.

Double Explore-then-Commit: Asymptotic Optimality and Beyond

- Computer ScienceCOLT
- 2021

This paper proposes a double explore-then-commit (DETC) algorithm that has two exploration and exploitation phases and proves that DETC achieves the asymptotically optimal regret bound and is the first non-fully-sequential algorithm that achieves such asymPTotic optimality.

Online Convex Optimization with Continuous Switching Constraint

- Computer ScienceNeurIPS
- 2021

The essential idea is to carefully design an adaptive adversary, who can adjust the loss function according to the cumulative switching cost of the player incurred so far based on the orthogonal technique, and develop a simple gradient-based algorithm which enjoys the minimax optimal regret bound.

Batched Neural Bandits

- Computer ScienceArXiv
- 2021

This work proposes the BatchNeuralUCB algorithm which combines neural networks with optimism to address the exploration-exploitation tradeoff while keeping the total number of batches limited and proves that it achieves the same regret as the fully sequential version while reducing the number of policy updates considerably.

Provably Efficient Reinforcement Learning with Linear Function Approximation

- Computer ScienceCOLT
- 2020

This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.

Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits

- Computer ScienceArXiv
- 2021

A lower bound theorem is proved that surprisingly shows the optimality of the authors' two-phase regret upper bound (up to logarithmic factors) in the full range of the problem parameters, therefore establishing the exact batch-regret tradeoff.

Design of Experiments for Stochastic Contextual Linear Bandits

- Computer ScienceNeurIPS
- 2021

This work designs a single stochastic policy to collect a good dataset from which a near-optimal policy can be extracted and presents a theoretical analysis as well as numerical experiments on both synthetic and real-world datasets.

Encrypted Linear Contextual Bandit

- Computer ScienceAISTATS
- 2022

This paper introduces a privacy-preserving bandit framework based on homomorphic encryption which allows computations using encrypted data and shows that despite the complexity of the setting, it is possible to solve linear contextual bandits over encrypted data with a regret bound in any linear contextual bandit problem, while keeping data encrypted.

PAC Top-k Identification under SST in Limited Rounds

- Computer Science
- 2022

It is shown that Ω(nk) comparisons are necessary for PAC top-k identification under SST even with unbounded adaptivity, establishing that this problem is strictly harder under S ST than it is for the noisy comparison model.

Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality

- Computer ScienceArXiv
- 2022

The fundamental limit in achieving deployment efficiency is revealed by establishing information-theoretic lower bounds, and algorithms that achieve the optimal deployment efficiency are provided.

## References

SHOWING 1-10 OF 57 REFERENCES

Sequential Batch Learning in Finite-Action Linear Contextual Bandits

- Computer ScienceArXiv
- 2020

This work establishes a regret lower bound and provides an algorithm, whose regret upper bound nearly matches the lower bound, that provides a near-complete characterization of sequential decision making in linear contextual bandits when batch constraints are present.

Bandits with switching costs: T2/3 regret

- Computer ScienceSTOC
- 2014

It is proved that the player's T-round minimax regret in this setting is [EQUATION], thereby closing a fundamental gap in the understanding of learning with bandit feedback and resolving several other open problems in online learning.

Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits

- Computer ScienceCOLT
- 2019

A Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors is introduced, revealing a regret scaling quite different from classical multi-armed bandits in which no logarathmic $T$ term is present in minimax regret.

Provably Efficient Q-Learning with Low Switching Cost

- Computer ScienceNeurIPS
- 2019

The main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is $O(H^3SA\log K)$, and a lower bound of $\Omega(HSA)$ on the local switching costs for any no-regret algorithm.

Online Learning with Switching Costs and Other Adaptive Adversaries

- Computer Science, MathematicsNIPS
- 2013

It is shown that with switching costs, the attainable rate with bandit feedback is Θ(T2/3), and it is proved that switching costs are easier to control than bounded memory adversaries.

Stochastic Linear Optimization under Bandit Feedback

- Computer Science, MathematicsCOLT
- 2008

A nearly complete characterization of the classical stochastic k-armed bandit problem in terms of both upper and lower bounds for the regret is given, and two variants of an algorithm based on the idea of “upper confidence bounds” are presented.

Linearly Parameterized Bandits

- Computer Science, MathematicsMath. Oper. Res.
- 2010

It is proved that the regret and Bayes risk is of order Θ(r √T), by establishing a lower bound for an arbitrary policy, and showing that a matching upper bound is obtained through a policy that alternates between exploration and exploitation phases.

Contextual Bandits with Linear Payoff Functions

- Computer ScienceAISTATS
- 2011

An O (√ Td ln (KT ln(T )/δ) ) regret bound is proved that holds with probability 1− δ for the simplest known upper confidence bound algorithm for this problem.

Minimax Bounds on Stochastic Batched Convex Optimization

- Computer ScienceCOLT
- 2018

Lower and upper bounds on the performance of such batched convex optimization algorithms in zeroth and first-order settings for Lipschitz convex and smooth strongly convex functions are provided.

Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition

- Computer ScienceNeurIPS
- 2020

A model-free algorithm UCB-Advantage is proposed and it is proved that it achieves $\tilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is the number of episodes to play.