# Analyzing the Variance of Policy Gradient Estimators for the Linear-Quadratic Regulator

@article{Preiss2019AnalyzingTV, title={Analyzing the Variance of Policy Gradient Estimators for the Linear-Quadratic Regulator}, author={James A. Preiss and S{\'e}bastien M. R. Arnold and Chengdong Wei and M. Kloft}, journal={ArXiv}, year={2019}, volume={abs/1910.01249} }

We study the variance of the REINFORCE policy gradient estimator in environments with continuous state and action spaces, linear dynamics, quadratic cost, and Gaussian noise. These simple environments allow us to derive bounds on the estimator variance in terms of the environment and noise parameters. We compare the predictions of our bounds to the empirical variance in simulation experiments.

## 5 Citations

### How are policy gradient methods affected by the limits of control?

- MathematicsArXiv
- 2022

We study stochastic policy gradient methods from the perspective of control-theoretic limitations. Our main result is that ill-conditioned linear systems in the sense of Doyle inevitably lead to…

### Robust Reinforcement Learning: A Case Study in Linear Quadratic Regulation

- MathematicsAAAI
- 2021

The benchmark problem of discrete-time linear quadratic regulation (LQR) is revisited and it is shown that policy iteration for LQR is inherently robust to small errors and enjoys local input-to-state stability.

### Decentralized Policy Gradient Method for Mean-Field Linear Quadratic Regulator with Global Convergence

- Computer Science
- 2020

This paper presents the first decentralized policy gradient method (MF-DPGM) for mean-field multi-agent reinforcement learning, where a large team of exchangeable agents communicate via a connected network and gives a rigorous proof of the global convergence rate of MF- DPGM.

### Combining Model-Based and Model-Free Methods for Nonlinear Control: A Provably Convergent Policy Gradient Approach

- Computer ScienceArXiv
- 2020

This paper considers a dynamical system with both linear and non-linear components and develops a novel approach to use the linear model to define a warm start for a model-free, policy gradient method, and shows this hybrid approach outperforms the model-based controller while avoiding the convergence issues.

### Exploiting Linear Models for Model-Free Nonlinear Control: A Provably Convergent Policy Gradient Approach

- Computer Science, Mathematics2021 60th IEEE Conference on Decision and Control (CDC)
- 2021

This paper considers a dynamical system with both linear and non- linear components and uses the linear model to define a warm start for a model-free, policy gradient method, and derives sufficient conditions on the non-linear component such that this approach is guaranteed to converge to the (nearly) global optimal controller.

## References

SHOWING 1-10 OF 19 REFERENCES

### Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

- Computer ScienceJ. Mach. Learn. Res.
- 2004

This paper considers variance reduction methods that were developed for Monte Carlo estimates of integrals, and gives bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties of the controlled system.

### Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems

- Mathematics, Computer ScienceAISTATS
- 2019

This work characterizes the convergence rate of a canonical stochastic, two-point, derivative-free method for linear-quadratic systems in which the initial state of the system is drawn at random, and shows that for problems with effective dimension $D$, such a method converges to an $\epsilon$-approximate solution within $\widetilde{\mathcal{O}}(D/\ep silon)$ steps.

### The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint

- MathematicsCOLT
- 2019

This work shows that for policy evaluation, a simple model-based plugin method requires asymptotically less samples than the classical least-squares temporal difference (LSTD) estimator to reach the same quality of solution; the sample complexity gap between the two methods can be at least a factor of state dimension.

### On the Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost

- Computer ScienceArXiv
- 2019

It is proved that actor-critic finds a globally optimal pair of actor and critic at a linear rate of convergence, which may serve as a preliminary step towards a complete theoretical understanding of bilevel optimization with nonconvex subproblems, which is NP-hard in the worst case and is often solved using heuristics.

### Trust Region Policy Optimization

- Computer ScienceICML
- 2015

A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).

### Global Convergence of Policy Gradient Methods for Linearized Control Problems

- Computer ScienceICML 2018
- 2018

This work bridges the gap showing that (model free) policy gradient methods globally converge to the optimal solution and are efficient (polynomially so in relevant problem dependent quantities) with regards to their sample and computational complexities.

### Proximal Policy Optimization Algorithms

- Computer ScienceArXiv
- 2017

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective…

### A Tour of Reinforcement Learning: The View from Continuous Control

- Computer ScienceAnnual Review of Control, Robotics, and Autonomous Systems
- 2019

This article surveys reinforcement learning from the perspective of optimization and control, with a focus on continuous control applications. It reviews the general formulation, terminology, and…

### Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

- Computer ScienceArXiv
- 2018

This article will discuss how a generalization of the reinforcement learning or optimal control problem, which is sometimes termed maximum entropy reinforcement learning, is equivalent to exact probabilistic inference in the case of deterministic dynamics, and variational inference inThe case of stochastic dynamics.

### Linear Optimal Control Systems

- Mathematics
- 1972

An excellent introduction to feedback control system design, this book offers a theoretical approach that captures the essential issues and can be applied to a wide range of practical problems.