# Model-Free Linear Quadratic Control via Reduction to Expert Prediction

@inproceedings{AbbasiYadkori2019ModelFreeLQ, title={Model-Free Linear Quadratic Control via Reduction to Expert Prediction}, author={Yasin Abbasi-Yadkori and Nevena Lazic and Csaba Szepesvari}, booktitle={AISTATS}, year={2019} }

Model-free approaches for reinforcement learning (RL) and continuous control find policies based only on past states and rewards, without fitting a model of the system dynamics. They are appealing as they are general purpose and easy to implement; however, they also come with fewer theoretical guarantees than model-based RL. In this work, we present a new model-free algorithm for controlling linear quadratic (LQ) systems, and show that its regret scales as $O(T^{\xi+2/3})$ for any small $\xi>0… Expand

#### Figures and Topics from this paper

#### 62 Citations

Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator

- Computer Science, Mathematics
- NeurIPS
- 2019

A simple adaptive procedure based on $\varepsilon$-greedy exploration which relies on approximate PI as a sub-routine and obtains regret is constructed, improving upon a recent result of Abbasi-Yadkori et al. Expand

Learning the model-free linear quadratic regulator via random search

- Mathematics, Computer Science
- L4DC
- 2020

This paper examines the standard infinite-horizon linear quadratic regulator problem for continuous-time systems with unknown state-space parameters and provides theoretical bounds on the convergence rate and sample complexity of a random search method. Expand

The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint

- Mathematics, Computer Science
- COLT
- 2019

This work shows that for policy evaluation, a simple model-based plugin method requires asymptotically less samples than the classical least-squares temporal difference (LSTD) estimator to reach the same quality of solution; the sample complexity gap between the two methods can be at least a factor of state dimension. Expand

Online Policy Gradient for Model Free Learning of Linear Quadratic Regulators with √T Regret

- Mathematics, Computer Science
- ICML
- 2021

This work presents the first model-free algorithm that achieves similar regret guarantees, and relies on an efficient policy gradient scheme, and a novel and tighter analysis of the cost of exploration in policy space in this setting. Expand

Regret Bound of Adaptive Control in Linear Quadratic Gaussian (LQG) Systems

- Computer Science, Mathematics
- ArXiv
- 2020

The regret upper bound of O(√T) for adaptive control of linear quadratic Gaussian (LQG) systems is proved, where T is the time horizon of the problem. Expand

Using Reinforcement Learning for Model-free Linear Quadratic Control with Process and Measurement Noises

- Computer Science, Mathematics
- 2019 IEEE 58th Conference on Decision and Control (CDC)
- 2019

A completely model-free reinforcement learning algorithm to solve the LQ problem where each policy is greedy with respect to all previous value functions and it is proved that the algorithm produces stable policies given that the estimation errors remain small. Expand

Average-reward model-free reinforcement learning: a systematic review and literature mapping

- Computer Science
- ArXiv
- 2020

An updated review of work in model-free reinforcement learning is provided and it is extended to cover policy-iteration and function approximation methods (in addition to the value-iterated and tabular counterparts) to identify and discuss opportunities for future work. Expand

Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems

- Computer Science, Mathematics
- AISTATS
- 2019

This work characterizes the convergence rate of a canonical stochastic, two-point, derivative-free method for linear-quadratic systems in which the initial state of the system is drawn at random, and shows that for problems with effective dimension $D$, such a method converges to an $\epsilon$-approximate solution within $\widetilde{\mathcal{O}}(D/\ep silon)$ steps. Expand

Convergence Guarantees of Policy Optimization Methods for Markovian Jump Linear Systems

- Computer Science, Mathematics
- 2020 American Control Conference (ACC)
- 2020

This work proves that the Gauss-Newton method and the natural policy gradient method converge to the optimal state feedback controller for MJLS at a linear rate if initialized at a controller which stabilizes the closed-loop dynamics in the mean square sense. Expand

Continuous Control with Contexts, Provably

- Computer Science, Mathematics
- ArXiv
- 2019

This paper studies how to build a decoder for the fundamental continuous control task, linear quadratic regulator (LQR), which can model a wide range of real-world physical environments and presents a simple algorithm, which uses upper confidence bound (UCB) to refine the estimate of the decoder and balance the exploration-exploitation trade-off. Expand

#### References

SHOWING 1-10 OF 60 REFERENCES

Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

- Computer Science, Mathematics
- NIPS
- 2012

This work presents an adaptive control scheme that achieves a regret bound of ${O}(p \sqrt{T})$, apart from logarithmic factors, and has prominent applications in the emerging area of computational advertising. Expand

Thompson Sampling for Linear-Quadratic Control Problems

- Mathematics, Computer Science
- AISTATS
- 2017

The regret of Thompson sampling in the frequentist setting is analyzed, which results in an overall regret of O(T^{2/3})$, which is significantly worse than the regret achieved by the optimism-in-face-of-uncertainty algorithm in LQ control problems. Expand

Global Convergence of Policy Gradient Methods for Linearized Control Problems

- Mathematics, Computer Science
- ICML 2018
- 2018

This work bridges the gap showing that (model free) policy gradient methods globally converge to the optimal solution and are efficient (polynomially so in relevant problem dependent quantities) with regards to their sample and computational complexities. Expand

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

- Mathematics, Computer Science
- Machine Learning
- 2007

A finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept, the approximation power of thefunction set and the controllability properties of the MDP is found. Expand

Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

- Computer Science
- COLT
- 2006

B batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems, when the training data consists of a single sample path (trajectory) of some behaviour policy is considered. Expand

Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator

- Computer Science, Mathematics
- ICML
- 2018

This work gives the first finite-time analysis of the number of samples needed to estimate the value function for a fixed static state-feedback policy to within $\varepsilon$-relative error. Expand

Regret Bounds for the Adaptive Control of Linear Quadratic Systems

- Mathematics, Computer Science
- COLT
- 2011

The construction of the condence set is based on the recent results from online least-squares estimation and leads to improved worst-case regret bound for the proposed algorithm, and is the the rst time that a regret bound is derived for the LQ control problem. Expand

PAC adaptive control of linear systems

- Computer Science
- COLT '97
- 1997

This work proposes a learning algorithm for a special case of reinforcement learning where the environment can be described by a linear system and shows that the control law produced by the algorithm has a value close to that of an optimal policy relative to the magnitude of the initial state of the system. Expand

Fast rates for online learning in Linearly Solvable Markov Decision Processes

- Computer Science, Mathematics
- COLT
- 2017

The smoothness of the control cost enables the simple algorithm of following the leader to achieve a regret of order $\log^2 T$ after $T$ rounds, vastly improving on the best known regret bound of order $T^{3/4}$ for this setting. Expand

Bayesian Optimal Control of Smoothly Parameterized Systems

- Computer Science, Mathematics
- UAI
- 2015

A lazy version of the so-called posterior sampling method, a method that goes back to Thompson and Strens, that allows for a single algorithm and a single analysis for a wide range of problems, such as finite MDPs or linear quadratic regulation. Expand