• Corpus ID: 90237357

Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits

@article{Li2019NearlyMR,
  title={Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits},
  author={Yingkai Li and Yining Wang and Yuanshuo Zhou},
  journal={ArXiv},
  year={2019},
  volume={abs/1904.00242}
}
We study the linear contextual bandit problem with finite action sets. When the problem dimension is $d$, the time horizon is $T$, and there are $n \leq 2^{d/2}$ candidate actions per time period, we (1) show that the minimax expected regret is $\Omega(\sqrt{dT (\log T) (\log n)})$ for every algorithm, and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors. Our algorithmic result saves two $\sqrt{\log T… 

Tables from this paper

No-Regret Linear Bandits beyond Realizability

It is shown that the classical LinUCB algorithm -- designed for the realizable case -- is automatically robust against gap-adjusted misspecification, which achieves a near-optimal $\sqrt{T}$ regret for problems that the best-known regret is almost linear in time horizon $T$.

Double Doubly Robust Thompson Sampling for Generalized Linear Contextual Bandits

This work proposes a novel contextual bandit algorithm for generalized linear rewards with an $\tilde{O}(\sqrt{\kappa^{-1} \phi T})$ regret over $T$ rounds where $\phi$ is the minimum eigenvalue of the covariance of contexts and $\kappa$ is a lower bound of the variance of rewards.

Tight Regret Bounds for Infinite-armed Linear Contextual Bandits

Regret upper bound of O(\sqrt{d^2T\log T})\times \mathrm{poly}(\log\ log T)$ is proved where d is the domain dimension and $T$ is the time horizon.

Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs

Novel analyses that improve their regret bounds significantly are presented that critically relies on a novel peeling-based regret analysis that leverages the elliptical potential `count' lemma.

Variance-Aware Sparse Linear Bandits

The first variance-aware regret guarantee for sparse linear bandits is presented, and two recent algorithms are taken as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.

Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency

A variance-adaptive algorithm for linear mixture MDPs is proposed, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDP's.

Optimal Regret Is Achievable With Constant Approximate Inference Error: An Enhanced Bayesian Upper Confidence Bound Framework

The theoretical analysis demonstrates that for Bernoulli multi-armed bandits, EBUCB can achieve the optimal regret order if the inference error measured by two different $\alpha$-divergences is less than a constant, regardless of how large this constant is.

Improved Confidence Bounds for the Linear Logistic Model and Applications to Linear Bandits

Improved fixed-design confidence bounds for the linear logistic model are proposed by leveraging the self-concordance of the logistic loss inspired by Faury et al. (2020) and improving upon previous state-of-the-art performance guarantees.

Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP

This work shows how to construct variance-aware confidence sets for linear bandits and linear mixture Markov Decision Process and obtains the first regret bound that only scales logarithmically with H in the reinforcement learning with linear function approximation setting, thus exponentially improving existing results.

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

A new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise and a new, computationally efficient algorithm with linear function approximation named UCRL-VTR for the aforementioned linear mixture MDPs in the episodic undiscounted setting are proposed.
...

References

SHOWING 1-10 OF 39 REFERENCES

Efficient Contextual Bandits in Non-stationary Worlds

This work develops several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i.i.d. problems with sophisticated statistical tests so as to dynamically adapt to a change in distribution.

UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem

For this modified UCB algorithm, an improved bound on the regret is given with respect to the optimal reward for K-armed bandits after T trials.

KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints

This self-contained contribution simultaneously presents state-of-the-art techniques for regret minimization in bandit models, and an elementary construction of non-asymptotic confidence bounds based on the empirical likelihood method for bounded distributions.

Towards Instance Optimal Bounds for Best Arm Identification

The gap-entropy conjecture is made, and for any Gaussian Best-$1$-Arm instance with gaps of the form $2^{-k}$, any $\delta$-correct monotone algorithm requires $\Omega\left(H(I))\cdot\left(\ln\delta^{-1} + \mathsf{Ent}(I)\right)$ samples in expectation.

Stochastic Linear Optimization under Bandit Feedback

A nearly complete characterization of the classical stochastic k-armed bandit problem in terms of both upper and lower bounds for the regret is given, and two variants of an algorithm based on the idea of “upper confidence bounds” are presented.

The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

This work considers the Multi-armed bandit problem under the PAC (“probably approximately correct”) model and generalizes the lower bound to a Bayesian setting, and to the case where the statistics of the arms are known but the identities of the Arms are not.

Contextual Bandits with Linear Payoff Functions

An O (√ Td ln (KT ln(T )/δ) ) regret bound is proved that holds with probability 1− δ for the simplest known upper confidence bound algorithm for this problem.

Almost Optimal Exploration in Multi-Armed Bandits

Two novel, parameter-free algorithms for identifying the best arm, in two different settings: given a target confidence and given atarget budget of arm pulls, are presented, for which upper bounds whose gap from the lower bound is only doubly-logarithmic in the problem parameters are proved.

Refining the Confidence Level for Optimistic Bandit Strategies

This paper introduces the first strategy for stochastic bandits with unit variance Gaussian noise that is simultaneously minimax optimal up to constant factors, asymptotically optimal, and never

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

The focus is on two extreme cases in which the analysis of regret is particularly simple and elegant: independent and identically distributed payoffs and adversarial payoffs.