Corpus ID: 219708289

Q-learning with Logarithmic Regret

@inproceedings{Yang2021QlearningWL,
  title={Q-learning with Logarithmic Regret},
  author={Kunhe Yang and Lin F. Yang and Simon Shaolei Du},
  booktitle={AISTATS},
  year={2021}
}
This paper presents the first non-asymptotic result showing that a model-free algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap in the optimal $Q$-function. We prove that the optimistic $Q$-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\mathrm{gap}_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound, where $S$ is the… Expand
Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes
TLDR
A new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise and a new, computationally efficient algorithm with linear function approximation named UCRL-VTR for the aforementioned linear mixture MDPs in the episodic undiscounted setting are proposed. Expand
An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap
TLDR
This work's hardness result shows that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed in addition to having a linearly realizable optimal Qfunction, which implies an exponential separation between the online RL setting and the generative model setting. Expand
Minimax Sample Complexity for Turn-based Stochastic Game
TLDR
This work proves that the plug-in solver approach, probably the most natural reinforcement learning algorithm, achieves minimax sample complexity for turn-based stochastic game (TBSG) by utilizing a `simulator' that allows sampling from arbitrary state-action pair. Expand
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning
  • Gen Li, Laixi Shi, Yuxin Chen, Yuantao Gu, Yuejie Chi
  • Computer Science, Mathematics
  • 2021
TLDR
A novel model-free algorithm, with space complexity O(SAH), that achieves near-optimal regret as soon as the sample size exceeds the order of SApoly(H) and improves — by at least a factor of SA — upon any prior memory-efficient algorithm that is asymptotically regret- optimal. Expand
Gap-Dependent Unsupervised Exploration for Reinforcement Learning
TLDR
An efficient algorithm is provided that takes only Õ(1/ · (HSA/ρ+HSA)) episodes of exploration, and is able to obtain an -optimal policy for a post-revealed reward with sub-optimality gap at least ρ, obtaining a nearly quadratic saving in terms of . Expand
Logarithmic Regret for Reinforcement Learning with Linear Function Approximation
TLDR
It is shown that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. Expand
Provably Efficient Multi-Task Reinforcement Learning with Model Transfer
  • Chicheng Zhang, Zhi Wang
  • Computer Science
  • ArXiv
  • 2021
TLDR
A heterogeneous multi-player RL problem, in which a group of players concurrently face similar but not necessarily identical MDPs, is formulated with a goal of improving their collective performance through inter-player information sharing. Expand
A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs
TLDR
A novel asymptotic problem-dependent lower-bound for regret minimization in finitehorizon tabular Markov Decision Processes (MDPs) is derived and shows that, in certain “simple” MDPs, the lower bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all. Expand
Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning
TLDR
It is shown that optimistic algorithms can not achieve the information-theoretic lower bounds even in deterministic MDPs unless there is a unique optimal policy, and tighter upper regret bounds for optimistic algorithms are proved. Expand
Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap
TLDR
A new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB), which enjoys a stronger gap-dependent regret bound, and complements its upper bound with a lower bound showing the dependency on |Zmul| ∆min is unavoidable for any consistent algorithm. Expand
...
1
2
...

References

SHOWING 1-10 OF 50 REFERENCES
Minimax Regret Bounds for Reinforcement Learning
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound ofExpand
Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity
TLDR
The open problem on agnostic $Q$-learning proposed in [Wen and Van Roy, NIPS 2013] is settled and the upper bound suggests that the sample complexity of $\widetilde{\Theta}\left(\rho/\sqrt{\mathrm{dim}_E\right)$ is tight even in the agnostic setting. Expand
Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP
TLDR
It is shown that the sample complexity of exploration of the proposed Q-learning algorithm is bounded by $\tilde{O}({\frac{SA}{\epsilon^2(1-\gamma)^7}})$, which improves the previously best known result. Expand
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model
TLDR
To the best of the knowledge, this work provides the first minimax-optimal guarantee in a generative model that accommodates the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically impossible). Expand
Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition
TLDR
A model-free algorithm UCB-Advantage is proposed and it is proved that it achieves $\tilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is the number of episodes to play. Expand
Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning
TLDR
The upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-Horizon dependency of at least $H^3$. Expand
Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model
TLDR
The method is extended to computing $\epsilon$-optimal policies for finite-horizon MDP with a generative model and matches the sample complexity lower bounds proved in \cite{azar2013minimax} up to logarithmic factors. Expand
Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes
TLDR
This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and it is hoped that the analysis may be useful more broadly. Expand
Is Q-learning Provably Efficient?
Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typicallyExpand
Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds
TLDR
An algorithm for finite horizon discrete MDPs and associated analysis that both yields state-of-the art worst-case regret bounds in the dominant terms and yields substantially tighter bounds if the RL environment has small environmental norm, which is a function of the variance of the next-state value functions. Expand
...
1
2
3
4
5
...