Corpus ID: 231749632

A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

@article{Chen2021ALT,
  title={A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants},
  author={Zaiwei Chen and Siva Theja Maguluri and S. Shakkottai and Karthikeyan Shanmugam},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.01567}
}
This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous reinforcement learning (RL) algorithms. We do this by first reformulating the RL algorithms as Markovian Stochastic Approximation (SA) algorithms to solve fixed-point equations. We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this result, we establish finite-sample mean-square convergence… Expand

Tables from this paper

A Discrete-Time Switching System Analysis of Q-learning
TLDR
A novel control-theoretic framework is developed to analyze the non-asymptotic convergence of Q-learning and derives a new finite-time error bound of asynchronous Q- learning when a constant stepsize is used. Expand
Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm
TLDR
This paper shows that an off-policy variant of the natural actor-critic (NAC) algorithm based on Importance Sampling converges to a global optimal policy with a sample complexity of O( −3 log(1/ )) under an appropriate choice of stepsizes. Expand
Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators
TLDR
Finite-sample bounds are derived for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed-point of this generalized Bellman operator. Expand
Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear Function Approximation
TLDR
A novel variant of off-policy natural actor-critic algorithm with linear function approximation is developed and a sample complexity of O(ǫ−3) is established, outperforming all the previously known convergence bounds of such algorithms. Expand
Finite-Time Error Analysis of Asynchronous Q-Learning with Discrete-Time Switching System Models
TLDR
It is proved that asynchronous Q-learning with a constant step-size can be naturally formulated as discretetime stochastic switched linear systems and overand underestimated by trajectories of two dynamical systems. Expand
Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning
TLDR
This work sharpen the sample complexity of synchronous Q-learning to the order of |S||A| (1−γ)4ε2 (up to some logarithmic factor) for any 0 < ε < 1, leading to an order-wise improvement in 1 1−γ . Expand
Concentration of Contractive Stochastic Approximation and Reinforcement Learning
Using a martingale concentration inequality, concentration bounds ‘from time n0 on’ are derived for stochastic approximation algorithms with contractive maps and both martingale difference and MarkovExpand
Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis
Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronousExpand

References

SHOWING 1-10 OF 66 REFERENCES
Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning
We consider the dynamics of a linear stochastic approximation algorithm driven by Markovian noise, and derive finite-time bounds on the moments of the error, i.e., deviation of the output of theExpand
Finite-Time Analysis for Double Q-learning
TLDR
This paper provides the first non-asymptotic (i.e., finite-time) analysis for double Q-learning and develops novel techniques to derive finite- time bounds on the difference between two inter-connected stochastic processes, which is new to the literature of Stochastic approximation. Expand
Finite-Sample Analysis of Contractive Stochastic Approximation Using Smooth Convex Envelopes
TLDR
This paper considers an SA involving a contraction mapping with respect to an arbitrary norm, and shows its finite-sample error bounds while using different stepsizes, and uses it to establish the first-known convergence rate of the V-trace algorithm for off-policy TD-learning. Expand
Stochastic approximation with cone-contractive operators: Sharp 𝓁∞-bounds for Q-learning
TLDR
These results show that relative to model-based Q-iteration, the `∞-based sample complexity of Q-learning is suboptimal in terms of the discount factor γ, and it is shown via simulation that the dependence of the bounds cannot be improved in a worst-case sense. Expand
Error bounds for constant step-size Q-learning
We provide a bound on the first moment of the error in the Q-function estimate resulting from fixed step-size algorithms applied to finite state-space, discounted reward Markov decision problems.Expand
Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning
TLDR
A general asynchronous Stochastic Approximation scheme featuring a weighted infinity-norm contractive operator is considered, and a bound on its finite-time convergence rate on a single trajectory is proved. Expand
On the Convergence of Stochastic Iterative Dynamic Programming Algorithms
TLDR
A rigorous proof of convergence of DP-based learning algorithms is provided by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem, which establishes a general class of convergent algorithms to which both TD() and Q-learning belong. Expand
A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation
TLDR
This work provides a simple and explicit finite time analysis of temporal difference learning with linear function approximation and mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. Expand
Least-Squares Policy Iteration: Bias-Variance Trade-off in Control Problems
TLDR
This work introduces a new approximate version of λ-Policy Iteration, a method that generalizes Value Iteration and Policy Iteration with a parameter λ ∈ (0,1), and shows empirically on a simple chain problem and on the Tetris game that this λ parameter acts as a bias-variance trade-off that may improve the convergence and the performance of the policy obtained. Expand
A Theoretical Analysis of Deep Q-Learning
TLDR
This work makes the first attempt to theoretically understand the deep Q-network (DQN) algorithm from both algorithmic and statistical perspectives and proposes the Minimax-D QN algorithm for zero-sum Markov game with two players. Expand
...
1
2
3
4
5
...