Linear bandits with limited adaptivity and learning distributional optimal design

@article{Ruan2021LinearBW,
  title={Linear bandits with limited adaptivity and learning distributional optimal design},
  author={Yufei Ruan and Jiaqi Yang and Yuanshuo Zhou},
  journal={Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing},
  year={2021}
}
Motivated by practical needs such as large-scale learning, we study the impact of adaptivity constraints to linear contextual bandits, a central problem in online learning and decision making. We consider two popular limited adaptivity models in literature: batch learning and rare policy switches. We show that, when the context vectors are adversarially chosen in d-dimensional linear contextual bandits, the learner needs O(d logd logT) policy switches to achieve the minimax-optimal regret, and… 

Tables from this paper

Provable Benefits of Representation Learning in Linear Bandits
TLDR
A new algorithm is presented which achieves a corresponding regret bound which demonstrates the benefit of representation learning in certain regimes, and an $\Omega(T\sqrt{kN} + \sqrt {dkNT})$ regret lower bound is provided, showing that the algorithm is minimax-optimal up to poly-logarithmic factors.
Double Explore-then-Commit: Asymptotic Optimality and Beyond
TLDR
This paper proposes a double explore-then-commit (DETC) algorithm that has two exploration and exploitation phases and proves that DETC achieves the asymptotically optimal regret bound and is the first non-fully-sequential algorithm that achieves such asymPTotic optimality.
Online Convex Optimization with Continuous Switching Constraint
TLDR
The essential idea is to carefully design an adaptive adversary, who can adjust the loss function according to the cumulative switching cost of the player incurred so far based on the orthogonal technique, and develop a simple gradient-based algorithm which enjoys the minimax optimal regret bound.
Batched Neural Bandits
TLDR
This work proposes the BatchNeuralUCB algorithm which combines neural networks with optimism to address the exploration-exploitation tradeoff while keeping the total number of batches limited and proves that it achieves the same regret as the fully sequential version while reducing the number of policy updates considerably.
Provably Efficient Reinforcement Learning with Linear Function Approximation
TLDR
This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.
Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits
TLDR
A lower bound theorem is proved that surprisingly shows the optimality of the authors' two-phase regret upper bound (up to logarithmic factors) in the full range of the problem parameters, therefore establishing the exact batch-regret tradeoff.
Design of Experiments for Stochastic Contextual Linear Bandits
TLDR
This work designs a single stochastic policy to collect a good dataset from which a near-optimal policy can be extracted and presents a theoretical analysis as well as numerical experiments on both synthetic and real-world datasets.
Encrypted Linear Contextual Bandit
TLDR
This paper introduces a privacy-preserving bandit framework based on homomorphic encryption which allows computations using encrypted data and shows that despite the complexity of the setting, it is possible to solve linear contextual bandits over encrypted data with a regret bound in any linear contextual bandit problem, while keeping data encrypted.
PAC Top-k Identification under SST in Limited Rounds
TLDR
It is shown that Ω(nk) comparisons are necessary for PAC top-k identification under SST even with unbounded adaptivity, establishing that this problem is strictly harder under S ST than it is for the noisy comparison model.
Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality
TLDR
The fundamental limit in achieving deployment efficiency is revealed by establishing information-theoretic lower bounds, and algorithms that achieve the optimal deployment efficiency are provided.
...
1
2
...

References

SHOWING 1-10 OF 57 REFERENCES
Sequential Batch Learning in Finite-Action Linear Contextual Bandits
TLDR
This work establishes a regret lower bound and provides an algorithm, whose regret upper bound nearly matches the lower bound, that provides a near-complete characterization of sequential decision making in linear contextual bandits when batch constraints are present.
Bandits with switching costs: T2/3 regret
TLDR
It is proved that the player's T-round minimax regret in this setting is [EQUATION], thereby closing a fundamental gap in the understanding of learning with bandit feedback and resolving several other open problems in online learning.
Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits
TLDR
A Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors is introduced, revealing a regret scaling quite different from classical multi-armed bandits in which no logarathmic $T$ term is present in minimax regret.
Provably Efficient Q-Learning with Low Switching Cost
TLDR
The main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is $O(H^3SA\log K)$, and a lower bound of $\Omega(HSA)$ on the local switching costs for any no-regret algorithm.
Online Learning with Switching Costs and Other Adaptive Adversaries
TLDR
It is shown that with switching costs, the attainable rate with bandit feedback is Θ(T2/3), and it is proved that switching costs are easier to control than bounded memory adversaries.
Stochastic Linear Optimization under Bandit Feedback
TLDR
A nearly complete characterization of the classical stochastic k-armed bandit problem in terms of both upper and lower bounds for the regret is given, and two variants of an algorithm based on the idea of “upper confidence bounds” are presented.
Linearly Parameterized Bandits
TLDR
It is proved that the regret and Bayes risk is of order Θ(r √T), by establishing a lower bound for an arbitrary policy, and showing that a matching upper bound is obtained through a policy that alternates between exploration and exploitation phases.
Contextual Bandits with Linear Payoff Functions
TLDR
An O (√ Td ln (KT ln(T )/δ) ) regret bound is proved that holds with probability 1− δ for the simplest known upper confidence bound algorithm for this problem.
Minimax Bounds on Stochastic Batched Convex Optimization
TLDR
Lower and upper bounds on the performance of such batched convex optimization algorithms in zeroth and first-order settings for Lipschitz convex and smooth strongly convex functions are provided.
Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition
TLDR
A model-free algorithm UCB-Advantage is proposed and it is proved that it achieves $\tilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is the number of episodes to play.
...
1
2
3
4
5
...