Logarithmic Regret for Reinforcement Learning with Linear Function Approximation
@article{He2020LogarithmicRF, title={Logarithmic Regret for Reinforcement Learning with Linear Function Approximation}, author={Jiafan He and Dongruo Zhou and Quanquan Gu}, journal={ArXiv}, year={2020}, volume={abs/2011.11566} }
Reinforcement learning (RL) with linear function approximation has received increasing attention recently. However, existing work has focused on obtaining $\sqrt{T}$-type regret bound, where $T$ is the number of steps. In this paper, we show that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. In specific, under the linear MDP assumption (Jin et al. 2019), the LSVI…
59 Citations
Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes
- Computer ScienceCOLT
- 2021
A new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise and a new, computationally efficient algorithm with linear function approximation named UCRL-VTR for the aforementioned linear mixture MDPs in the episodic undiscounted setting are proposed.
Provably Efficient Reinforcement Learning with Linear Function Approximation
- Computer ScienceCOLT
- 2020
This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.
Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap
- Computer ScienceCOLT
- 2021
A new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB), which enjoys a stronger gap-dependent regret bound, and complements its upper bound with a lower bound showing the dependency on |Zmul| ∆min is unavoidable for any consistent algorithm.
Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP
- Computer Science, MathematicsArXiv
- 2021
This work shows how to construct variance-aware confidence sets for linear bandits and linear mixture Markov Decision Process and obtains the first regret bound that only scales logarithmically with H in the reinforcement learning with linear function approximation setting, thus exponentially improving existing results.
Gap-Dependent Bounds for Two-Player Markov Games
- Computer ScienceAISTATS
- 2022
The cumulative regret when conducting Nash Q-learning algorithm on 2-player turn-based stochastic Markov games (2-TBSG) is analyzed, and the gap dependent logarithmic upper bounds in the episodic tabular setting are proposed.
Provably Efficient Representation Learning in Low-rank Markov Decision Processes
- Computer ScienceArXiv
- 2021
A provably efficient algorithm called ReLEX is proposed that can simultaneously learn the representation and perform exploration and will be strictly better in terms of sample efficiency if the function class of representations enjoys a certain mild “coverage” property over the whole state-action space.
On the Interplay Between Misspecification and Sub-optimality Gap in Linear Contextual Bandits
- Computer ScienceArXiv
- 2023
An algorithm based on a novel data selection scheme, which only selects the contextual vectors with large uncertainty for online regression that enjoys the same gap-dependent regret bound as in the well-specified setting up to logarithmic factors is proposed.
Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency
- Computer ScienceArXiv
- 2023
A variance-adaptive algorithm for linear mixture MDPs is proposed, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDP's.
On Instance-Dependent Bounds for Offline Reinforcement Learning with Linear Function Approximation
- Computer Science, MathematicsArXiv
- 2022
,
Cascaded Gaps: Towards Logarithmic Regret for Risk-Sensitive Reinforcement Learning
- Computer ScienceICML
- 2022
Based on cascaded gaps, non-asymptotic and logarithmic regret bounds for two model-free algorithms under episodic Markov decision processes are derived and it is shown that, in appropriate settings, these bounds feature exponential improvement over existing ones that are independent of gaps.
References
SHOWING 1-10 OF 45 REFERENCES
Q-learning with Logarithmic Regret
- Computer ScienceAISTATS
- 2021
This paper presents the first non-asymptotic result showing that a model-free algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a…
Minimax Regret Bounds for Reinforcement Learning
- Computer ScienceICML
- 2017
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of…
Learning Near Optimal Policies with Low Inherent Bellman Error
- Computer ScienceICML
- 2020
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to…
Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP
- Computer ScienceICLR
- 2020
It is shown that the sample complexity of exploration of the proposed Q-learning algorithm is bounded by $\tilde{O}({\frac{SA}{\epsilon^2(1-\gamma)^7}})$, which improves the previously best known result.
Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition
- Computer ScienceNeurIPS
- 2020
A model-free algorithm UCB-Advantage is proposed and it is proved that it achieves $\tilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is the number of episodes to play.
Frequentist Regret Bounds for Randomized Least-Squares Value Iteration
- Computer ScienceAISTATS
- 2020
It is proved that the frequentist regret of RLSVI is upper-bounded by $\widetilde O(d^2 H^2 \sqrt{T})$ where d are the feature dimension, H is the horizon, and T is the total number of steps.
Model-Based Reinforcement Learning with Value-Targeted Regression
- Computer ScienceL4DC
- 2020
This paper proposes a model based RL algorithm that is based on optimism principle, and derives a bound on the regret, which is independent of the total number of states or actions, and is close to a lower bound $\Omega(\sqrt{HdT})$.
Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping
- Computer ScienceICML
- 2021
This paper proposes a novel algorithm which makes use of the feature mapping and obtains a first polynomial regret bound, and suggests that the proposed reinforcement learning algorithm is near-optimal up to a $(1-\gamma)^{-0.5}$ factor.
Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity
- Computer ScienceArXiv
- 2020
The open problem on agnostic $Q$-learning proposed in [Wen and Van Roy, NIPS 2013] is settled and the upper bound suggests that the sample complexity of $\widetilde{\Theta}\left(\rho/\sqrt{\mathrm{dim}_E\right)$ is tight even in the agnostic setting.
Provably Efficient Reinforcement Learning with Linear Function Approximation
- Computer ScienceCOLT
- 2020
This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.