• Corpus ID: 246441963

Single Time-scale Actor-critic Method to Solve the Linear Quadratic Regulator with Convergence Guarantees

@article{Zhou2022SingleTA,
  title={Single Time-scale Actor-critic Method to Solve the Linear Quadratic Regulator with Convergence Guarantees},
  author={Mo Zhou and Jianfeng Lu},
  journal={ArXiv},
  year={2022},
  volume={abs/2202.00048}
}
We propose a single timescale actor-critic algorithm to solve the linear quadratic regulator (LQR) problem. A least squares temporal difference (LSTD) method is applied to the critic and a natural policy gradient method is used for the actor. We give a proof of convergence with sample complexity O ( ε − 1 log( ε − 1 ) 2 ) . The method in the proof is applicable to general single timescale bilevel optimization problems. We also numerically validate our theoretical results on the convergence. 

Figures from this paper

Global Convergence of Two-timescale Actor-Critic for Solving Linear Quadratic Regulator

The actor-critic (AC) reinforcement learning algorithms have been the powerhouse behind many challenging applications. Nevertheless, its convergence is fragile in general. To study its instability,

References

SHOWING 1-10 OF 36 REFERENCES

On the Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost

TLDR
It is proved that actor-critic finds a globally optimal pair of actor and critic at a linear rate of convergence, which may serve as a preliminary step towards a complete theoretical understanding of bilevel optimization with nonconvex subproblems, which is NP-hard in the worst case and is often solved using heuristics.

Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy

TLDR
Under the broader scope of policy optimization with nonlinear function approximation, it is proved that actor-critic with deep neural network finds the globally optimal policy at a sublinear rate for the first time.

A Finite Time Analysis of Two Time-Scale Actor Critic Methods

TLDR
This work provides a non-asymptotic analysis for two time-scale actor-Critic methods under non-i.i.d. setting and proves that the actor-critic method is guaranteed to find a first-order stationary point.

Convergence and Sample Complexity of Gradient Methods for the Model-Free Linear–Quadratic Regulator Problem

TLDR
This article establishes exponential stability for the ordinary differential equation (ODE) that governs the gradient-flow dynamics over the set of stabilizing feedback gains and shows that a similar result holds for the gradient descent method that arises from the forward Euler discretization of the corresponding ODE.

A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

TLDR
The main results reproduce the best-known convergence rates for the general policy optimization problem and how they can be used to derive a state-of-the-art rate for the online linear-quadratic regulator (LQR) controllers.

Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator

TLDR
This work bridges the gap showing that (model free) policy gradient methods globally converge to the optimal solution and are efficient (polynomially so in relevant problem dependent quantities) with regards to their sample and computational complexities.

Natural Actor-Critic

A Single-Timescale Method for Stochastic Bilevel Optimization

TLDR
This paper develops a new optimization method for a class of stochastic bilevel problems that is the first to achieve the same order of sample complexity as SGD for single-level Stochastic optimization, and names it STABLE.

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

TLDR
This paper revisits and improves the convergence of policy gradient, natural PG (NPG) methods, and their variance-reduced variants, under general smooth policy parametrizations, and proposes SRVR-NPG, which incorporates variancereduction into the NPG update.

Linear Least-Squares algorithms for temporal difference learning

TLDR
Two new temporal diffence algorithms based on the theory of linear least-squares function approximation, LS TD and RLS TD, are introduced and prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters.