• Corpus ID: 220496101

Efficient Optimistic Exploration in Linear-Quadratic Regulators via Lagrangian Relaxation

  title={Efficient Optimistic Exploration in Linear-Quadratic Regulators via Lagrangian Relaxation},
  author={Marc Abeille and Alessandro Lazaric},
We study the exploration-exploitation dilemma in the linear quadratic regulator (LQR) setting. Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of \ofulq and cast it into a constrained \textit{extended} LQR problem, where an additional control variable implicitly selects the system dynamics within a confidence interval. We then move to the corresponding Lagrangian formulation for which we prove… 

Figures from this paper

Task-Optimal Exploration in Linear Dynamical Systems

This work study task-guided exploration and determines what precisely an agent must learn about their environment in order to complete a particular task, and establishes that certainty equivalence decision making is instanceand task-optimal.

Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning

This paper proposes a practical optimistic-exploration algorithm, which enlarges the input space with hallucinated inputs that can exert as much control as the epistemic uncertainty in the model affords, and shows how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms and different probabilistic models.

A modified Thompson sampling-based learning algorithm for unknown linear systems

This work revisits the Thompson sampling-based learning algorithm for controlling an unknown linear system with quadratic cost and shows that a careful choice of Tmin allows it to recover the regret bound under a milder technical condition about the closed loop system.

Thompson Sampling Achieves $\tilde{O}(\sqrt{T})$ Regret in Linear Quadratic Control

It is shown that TS achieves order-optimal regret in adaptive control of multidimensional stabilizable LQRs by carefully prescribing an early exploration strategy and a policy update rule, thereby solving the open problem posed in Abeille and Lazaric (2018.

Thompson-Sampling Based Reinforcement Learning for Networked Control of Unknown Linear Systems

These are the first results to generalize regret bounds of LQG systems to packet-drop networked control models.

Reinforcement Learning with Fast Stabilization in Linear Dynamical Systems

This work proposes an algorithm that certifies fast stabilization of the underlying system by effectively exploring the environment with an improved exploration strategy by combining a sophisticated exploration policy in RL with an isotropic exploration strategy to achieve fast stabilization and improved regret.

Thompson Sampling Achieves Õ(√T) Regret in Linear Quadratic Control

It is shown that TS achieves order-optimal regret in adaptive control of multidimensional stabilizable LQRs, thereby solving the open problem posed in Abeille and Lazaric (2018) and developing a novel lower bound on the probability that the TS provides an optimistic sample.

Certainty Equivalent Quadratic Control for Markov Jump Systems

Robustness aspects of certainty equivalent model-based optimal control for MJS with quadratic cost function are investigated, given the uncertainty in the system matrices and in the Markov transition matrix is bounded by and η respectively.

Regret Lower Bounds for Learning Linear Quadratic Gaussian Systems

This paper presents local minimax regret lower bounds for adaptively controlling linear-quadratic-Gaussian (LQG) systems and establishes that a nontrivial class of partially observable systems, essentially those that are over-actuated, satisfy these conditions, thus providing a √ T lower bound also valid for partially observable Systems.

On Uninformative Optimal Policies in Adaptive LQR with Unknown B-Matrix

Local asymptotic minimax regret lower bounds for adaptive Linear Quadratic Regulators are presented and it is shown that if the parametrization induces an uninformative optimal policy, logarithmic regret is impossible and the rate is at least order square root in the time horizon.



Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems

A novel bound on the regret due to policy switches is obtained, which holds for LQ systems of any dimensionality and it allows updating the parameters and the policy at each step, thus overcoming previous limitations due to lazy updates.

Naive Exploration is Optimal for Online LQR

New upper and lower bounds are proved demonstrating that the optimal regret scales as $\widetilde{\Theta}({\sqrt{d_{\mathbf{u}}^2 d_{\ mathbf{x}} T}})$, where $T$ is the number of time steps, $d_{ \mathbf {u}}$ isthe dimension of the input space, and $d_x$ isTheta, the dimensions of the system state.

Certainty Equivalent Control of LQR is Efficient

The results show that certainty equivalent control with $\varepsilon$-greedy exploration achieves $\tilde{\mathcal{O}}(\sqrt{T})$ regret in the adaptive LQR setting, yielding the first guarantee of a computationally tractable algorithm that achieves nearly optimal regret for adaptive L QR.

Regret Bounds for the Adaptive Control of Linear Quadratic Systems

The construction of the condence set is based on the recent results from online least-squares estimation and leads to improved worst-case regret bound for the proposed algorithm, and is the the rst time that a regret bound is derived for the LQ control problem.

Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator

This work presents the first provably polynomial time algorithm that provides high probability guarantees of sub-linear regret on this problem of adaptive control of the Linear Quadratic Regulator, where an unknown linear system is controlled subject to quadratic costs.

Finite Time Analysis of Optimal Adaptive Policies for Linear-Quadratic Systems

Finite time high probability regret bounds that are optimal up to logarithmic factors are established and high probability guarantees for a stabilization algorithm based on random linear feedbacks are provided.

Input Perturbations for Adaptive Regulation and Learning

It is shown that perturbed Greedy guarantees non-asymptotic regret bounds of (nearly) square-root magnitude w.r.t. time, and high probability bounds on both the regret and the learning accuracy under arbitrary input perturbations are established.

Near-optimal Regret Bounds for Reinforcement Learning

This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.

Control of unknown linear systems with Thompson sampling

It is shown under some conditions on the prior distribution that the expected (Bayesian) regret of TSDE accumulated up to time T is bounded by Õ(√T).

Online Least Squares Estimation with Self-Normalized Processes: An Application to Bandit Problems

The regret bound of the Upper Confidence Bound algorithm of Auer et al. (2002) is improved and its regret is with high-probability a problem dependent constant, and new tighter confidence sets for the least squares estimate are constructed.