# A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

@article{Chen2021ALT, title={A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants}, author={Zaiwei Chen and Siva Theja Maguluri and S. Shakkottai and Karthikeyan Shanmugam}, journal={ArXiv}, year={2021}, volume={abs/2102.01567} }

This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous reinforcement learning (RL) algorithms. We do this by first reformulating the RL algorithms as Markovian Stochastic Approximation (SA) algorithms to solve fixed-point equations. We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this result, we establish finite-sample mean-square convergence… Expand

#### Tables from this paper

#### 8 Citations

A Discrete-Time Switching System Analysis of Q-learning

- Mathematics, Computer Science
- 2021

A novel control-theoretic framework is developed to analyze the non-asymptotic convergence of Q-learning and derives a new finite-time error bound of asynchronous Q- learning when a constant stepsize is used. Expand

Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm

- Computer Science, Mathematics
- ICML
- 2021

This paper shows that an off-policy variant of the natural actor-critic (NAC) algorithm based on Importance Sampling converges to a global optimal policy with a sample complexity of O( −3 log(1/ )) under an appropriate choice of stepsizes. Expand

Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators

- Computer Science, Mathematics
- ArXiv
- 2021

Finite-sample bounds are derived for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed-point of this generalized Bellman operator. Expand

Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear Function Approximation

- Computer Science, Mathematics
- ArXiv
- 2021

A novel variant of off-policy natural actor-critic algorithm with linear function approximation is developed and a sample complexity of O(ǫ−3) is established, outperforming all the previously known convergence bounds of such algorithms. Expand

Finite-Time Error Analysis of Asynchronous Q-Learning with Discrete-Time Switching System Models

- Computer Science
- ArXiv
- 2021

It is proved that asynchronous Q-learning with a constant step-size can be naturally formulated as discretetime stochastic switched linear systems and overand underestimated by trajectories of two dynamical systems. Expand

Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning

- Mathematics, Computer Science
- ICML
- 2021

This work sharpen the sample complexity of synchronous Q-learning to the order of |S||A| (1−γ)4ε2 (up to some logarithmic factor) for any 0 < ε < 1, leading to an order-wise improvement in 1 1−γ . Expand

Concentration of Contractive Stochastic Approximation and Reinforcement Learning

- Computer Science, Engineering
- ArXiv
- 2021

Using a martingale concentration inequality, concentration bounds ‘from time n0 on’ are derived for stochastic approximation algorithms with contractive maps and both martingale difference and Markov… Expand

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

- 2021

Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous… Expand

#### References

SHOWING 1-10 OF 66 REFERENCES

Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning

- Computer Science, Mathematics
- COLT
- 2019

We consider the dynamics of a linear stochastic approximation algorithm driven by Markovian noise, and derive finite-time bounds on the moments of the error, i.e., deviation of the output of the… Expand

Finite-Time Analysis for Double Q-learning

- Computer Science, Mathematics
- NeurIPS
- 2020

This paper provides the first non-asymptotic (i.e., finite-time) analysis for double Q-learning and develops novel techniques to derive finite- time bounds on the difference between two inter-connected stochastic processes, which is new to the literature of Stochastic approximation. Expand

Finite-Sample Analysis of Contractive Stochastic Approximation Using Smooth Convex Envelopes

- Computer Science, Mathematics
- NeurIPS
- 2020

This paper considers an SA involving a contraction mapping with respect to an arbitrary norm, and shows its finite-sample error bounds while using different stepsizes, and uses it to establish the first-known convergence rate of the V-trace algorithm for off-policy TD-learning. Expand

Stochastic approximation with cone-contractive operators: Sharp 𝓁∞-bounds for Q-learning

- Mathematics, Computer Science
- ArXiv
- 2019

These results show that relative to model-based Q-iteration, the `∞-based sample complexity of Q-learning is suboptimal in terms of the discount factor γ, and it is shown via simulation that the dependence of the bounds cannot be improved in a worst-case sense. Expand

Error bounds for constant step-size Q-learning

- Mathematics, Computer Science
- Syst. Control. Lett.
- 2012

We provide a bound on the first moment of the error in the Q-function estimate resulting from fixed step-size algorithms applied to finite state-space, discounted reward Markov decision problems.… Expand

Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning

- Mathematics, Computer Science
- COLT 2020
- 2020

A general asynchronous Stochastic Approximation scheme featuring a weighted infinity-norm contractive operator is considered, and a bound on its finite-time convergence rate on a single trajectory is proved. Expand

On the Convergence of Stochastic Iterative Dynamic Programming Algorithms

- Mathematics, Computer Science
- Neural Computation
- 1994

A rigorous proof of convergence of DP-based learning algorithms is provided by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem, which establishes a general class of convergent algorithms to which both TD() and Q-learning belong. Expand

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

- Computer Science, Mathematics
- COLT
- 2018

This work provides a simple and explicit finite time analysis of temporal difference learning with linear function approximation and mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. Expand

Least-Squares Policy Iteration: Bias-Variance Trade-off in Control Problems

- Mathematics, Computer Science
- ICML
- 2010

This work introduces a new approximate version of λ-Policy Iteration, a method that generalizes Value Iteration and Policy Iteration with a parameter λ ∈ (0,1), and shows empirically on a simple chain problem and on the Tetris game that this λ parameter acts as a bias-variance trade-off that may improve the convergence and the performance of the policy obtained. Expand

A Theoretical Analysis of Deep Q-Learning

- Computer Science, Mathematics
- L4DC
- 2020

This work makes the first attempt to theoretically understand the deep Q-network (DQN) algorithm from both algorithmic and statistical perspectives and proposes the Minimax-D QN algorithm for zero-sum Markov game with two players. Expand