# Efficient Optimistic Exploration in Linear-Quadratic Regulators via Lagrangian Relaxation

@article{Abeille2020EfficientOE, title={Efficient Optimistic Exploration in Linear-Quadratic Regulators via Lagrangian Relaxation}, author={Marc Abeille and Alessandro Lazaric}, journal={ArXiv}, year={2020}, volume={abs/2007.06482} }

We study the exploration-exploitation dilemma in the linear quadratic regulator (LQR) setting. Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of \ofulq and cast it into a constrained \textit{extended} LQR problem, where an additional control variable implicitly selects the system dynamics within a confidence interval. We then move to the corresponding Lagrangian formulation for which we prove…

## 20 Citations

### Task-Optimal Exploration in Linear Dynamical Systems

- Computer ScienceICML
- 2021

This work study task-guided exploration and determines what precisely an agent must learn about their environment in order to complete a particular task, and establishes that certainty equivalence decision making is instanceand task-optimal.

### Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning

- Computer ScienceNeurIPS
- 2020

This paper proposes a practical optimistic-exploration algorithm, which enlarges the input space with hallucinated inputs that can exert as much control as the epistemic uncertainty in the model affords, and shows how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms and different probabilistic models.

### A modified Thompson sampling-based learning algorithm for unknown linear systems

- Computer Science2022 IEEE 61st Conference on Decision and Control (CDC)
- 2022

This work revisits the Thompson sampling-based learning algorithm for controlling an unknown linear system with quadratic cost and shows that a careful choice of Tmin allows it to recover the regret bound under a milder technical condition about the closed loop system.

### Thompson Sampling Achieves $\tilde{O}(\sqrt{T})$ Regret in Linear Quadratic Control

- Computer ScienceCOLT
- 2022

It is shown that TS achieves order-optimal regret in adaptive control of multidimensional stabilizable LQRs by carefully prescribing an early exploration strategy and a policy update rule, thereby solving the open problem posed in Abeille and Lazaric (2018.

### Thompson-Sampling Based Reinforcement Learning for Networked Control of Unknown Linear Systems

- Computer Science2022 IEEE 61st Conference on Decision and Control (CDC)
- 2022

These are the first results to generalize regret bounds of LQG systems to packet-drop networked control models.

### Reinforcement Learning with Fast Stabilization in Linear Dynamical Systems

- Computer ScienceAISTATS
- 2022

This work proposes an algorithm that certifies fast stabilization of the underlying system by effectively exploring the environment with an improved exploration strategy by combining a sophisticated exploration policy in RL with an isotropic exploration strategy to achieve fast stabilization and improved regret.

### Thompson Sampling Achieves Õ(√T) Regret in Linear Quadratic Control

- Computer ScienceArXiv
- 2022

It is shown that TS achieves order-optimal regret in adaptive control of multidimensional stabilizable LQRs, thereby solving the open problem posed in Abeille and Lazaric (2018) and developing a novel lower bound on the probability that the TS provides an optimistic sample.

### Certainty Equivalent Quadratic Control for Markov Jump Systems

- Mathematics2022 American Control Conference (ACC)
- 2022

Robustness aspects of certainty equivalent model-based optimal control for MJS with quadratic cost function are investigated, given the uncertainty in the system matrices and in the Markov transition matrix is bounded by and η respectively.

### Regret Lower Bounds for Learning Linear Quadratic Gaussian Systems

- Computer ScienceArXiv
- 2022

This paper presents local minimax regret lower bounds for adaptively controlling linear-quadratic-Gaussian (LQG) systems and establishes that a nontrivial class of partially observable systems, essentially those that are over-actuated, satisfy these conditions, thus providing a √ T lower bound also valid for partially observable Systems.

### On Uninformative Optimal Policies in Adaptive LQR with Unknown B-Matrix

- Computer Science, MathematicsL4DC
- 2021

Local asymptotic minimax regret lower bounds for adaptive Linear Quadratic Regulators are presented and it is shown that if the parametrization induces an uninformative optimal policy, logarithmic regret is impossible and the rate is at least order square root in the time horizon.

## References

SHOWING 1-10 OF 18 REFERENCES

### Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems

- Computer Science, MathematicsICML
- 2018

A novel bound on the regret due to policy switches is obtained, which holds for LQ systems of any dimensionality and it allows updating the parameters and the policy at each step, thus overcoming previous limitations due to lazy updates.

### Naive Exploration is Optimal for Online LQR

- Computer Science, MathematicsICML
- 2020

New upper and lower bounds are proved demonstrating that the optimal regret scales as $\widetilde{\Theta}({\sqrt{d_{\mathbf{u}}^2 d_{\ mathbf{x}} T}})$, where $T$ is the number of time steps, $d_{ \mathbf {u}}$ isthe dimension of the input space, and $d_x$ isTheta, the dimensions of the system state.

### Certainty Equivalent Control of LQR is Efficient

- Computer ScienceArXiv
- 2019

The results show that certainty equivalent control with $\varepsilon$-greedy exploration achieves $\tilde{\mathcal{O}}(\sqrt{T})$ regret in the adaptive LQR setting, yielding the first guarantee of a computationally tractable algorithm that achieves nearly optimal regret for adaptive L QR.

### Regret Bounds for the Adaptive Control of Linear Quadratic Systems

- Computer Science, MathematicsCOLT
- 2011

The construction of the condence set is based on the recent results from online least-squares estimation and leads to improved worst-case regret bound for the proposed algorithm, and is the the rst time that a regret bound is derived for the LQ control problem.

### Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator

- Computer Science, MathematicsNeurIPS
- 2018

This work presents the first provably polynomial time algorithm that provides high probability guarantees of sub-linear regret on this problem of adaptive control of the Linear Quadratic Regulator, where an unknown linear system is controlled subject to quadratic costs.

### Finite Time Analysis of Optimal Adaptive Policies for Linear-Quadratic Systems

- Computer Science, MathematicsArXiv
- 2017

Finite time high probability regret bounds that are optimal up to logarithmic factors are established and high probability guarantees for a stabilization algorithm based on random linear feedbacks are provided.

### Input Perturbations for Adaptive Regulation and Learning

- Computer ScienceArXiv
- 2018

It is shown that perturbed Greedy guarantees non-asymptotic regret bounds of (nearly) square-root magnitude w.r.t. time, and high probability bounds on both the regret and the learning accuracy under arbitrary input perturbations are established.

### Near-optimal Regret Bounds for Reinforcement Learning

- Computer ScienceJ. Mach. Learn. Res.
- 2008

This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.

### Control of unknown linear systems with Thompson sampling

- Mathematics2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
- 2017

It is shown under some conditions on the prior distribution that the expected (Bayesian) regret of TSDE accumulated up to time T is bounded by Õ(√T).

### Online Least Squares Estimation with Self-Normalized Processes: An Application to Bandit Problems

- Computer Science, MathematicsArXiv
- 2011

The regret bound of the Upper Confidence Bound algorithm of Auer et al. (2002) is improved and its regret is with high-probability a problem dependent constant, and new tighter confidence sets for the least squares estimate are constructed.