Low-complexity Learning of Linear Quadratic Regulators from Noisy Data

This paper considers the Linear Quadratic Regulator problem for linear systems with unknown dynamics, a central problem in data-driven control and reinforcement learning. We propose a method that uses data to directly return a controller without estimating a model of the system. Sufficient conditions are given under which this method returns a stabilizing controller with guaranteed relative error when the data used to design the controller are affected by noise. This method has low complexity as it only requires a finite number of samples of the system response to a sufficiently exciting input, and can be efficiently implemented as a semi-definite program. Further, the method does not require assumptions on the noise statistics, and the relative error nicely scales with the noise magnitude.


Introduction
Control theory is witnessing an increasing renewed interest towards data-driven (data-based) control.This terminology refers to all those cases where the dynamics of the system are unknown and the control law must be designed using data alone.This can be done either by identifying a model of the system from data and then use the model for control design, or by directly designing the control law bypassing the system identification (ID) step.Methods in the first category are usually called indirect (sequential system ID and control design), while methods in the second category are usually called direct or model-free.
The interest for data-driven control has several motivations.As systems become more complex, first-principle models may be difficult to obtain or may be too complex for control design.Fully automated (end-to-end) procedures may also facilitate the online tuning or re-design of controllers, which is needed in all those applications where the system to be controlled or the environment are subject to changes that are difficult to predict.Dozens of publications on data-driven control have appeared in the last few years.We mention works on predictive control [Alpago, Lygeros, and Dörfler, 2020, Coulson, Lygeros, and Dörfler, 2019, Salvador, Muñoz de la Peña, Alamo, and Bemporad, 2018], optimal control [Gonc ¸alves da Silva, Bazanella, Lorenzini, and Campestrini, 2019, Baggio, Katewa, and Pasqualetti, 2019, Recht, 2019, De Persis and Tesi, 2019], robust and nonlinear control [De Persis and Tesi, 2020, Berberich, Koch, Scherer, and Allgöwer, 2019, Wabersich and Zeilinger, 2018, Dai and Sznaier, 2018, Novara, Formentin, Savaresi, and Milanese, 2016].This list is by no means exhaustive.We refer the interested reader to [Hou and Wang, 2013] for a survey on earlier contributions.

The Linear Quadratic Regulator problem
This paper considers the infinite horizon Linear Quadratic Regulator (LQR) problem for linear time-invariant systems, which is one of the problems more studied in the control literature.Besides its practical relevance, this problem is a prime example of the challenges encountered in data-driven control.Specifically, we consider the problem of computing the solution to the LQR problem from a finite set of (noisy) data collected from the system.
Early data-driven methods for LQR can be traced back to the theory of adaptive control systems, and include the popular self-tuning regulators [ Åström and Wittenmark, 1989] and policy iteration schemes [Bradtke, Ydstie, and Barto, 1994].While the specific techniques are different, the common idea is to study the convergence of an adaptive control law to the optimal one as time goes to infinity.Starting from [Fiechter, 1997], a tremendous effort has been made for establishing non-asymptotic properties of data-driven methods.This term refers to all those methods that aim at providing closedloop stability and performance guarantees using only a finite number of data points.The interest towards non-asymptotic properties is both theoretical and practical.Non-asymptotic properties help to derive performance guarantees of iterative (online) methods [Fazel, Ge, Kakade, and Mesbahi, 2018], and are at the basis of non-iterative (offline) methods [Recht, 2019, De Persis andTesi, 2019]. 1t turns out that non-asymptotic properties are very difficult to derive if one departs from the assumption that data are noise-free.Most of the works dealing with noisy data are of indirect type and come from the area of reinforcement learning (RL).The common approach is to learn a model of the system along with non-asymptotic probabilistic bounds on the estimation error [Campi and Weyer, 2002], and then design or update the control law depending on the specific method adopted (non-iterative or iterative).Among iterative methods we mention [Cohen, Koren, andMansour, 2019, Abbasi-Yadkori, Lazíc, andSzepesvári, 2018], where the latter is one of the few model-free methods that appeared in the literature.Among non-iterative methods we mention [Mania, Tu, andRecht, 2019, Dean, Mania, Matni, Recht, andTu, 2019].

Our contribution
Our contribution is a new approach to design LQ controllers from noisy data with guaranteed performance.The method builds on the framework introduced in [De Persis and Tesi, 2020] and has the following features: (i) Low complexity.The proposed method requires a finite (pre-computable) number of data points obtained from a single or multiple system's trajectories, and it can be implemented as a convex program.(ii) Stability and performance guarantees.As long as the noise satisfies suitable inequalities our method returns a stabilizing controller with quantitative relative error (gap between the computed solution and the unknown optimal controller) and the error nicely scales with the noise magnitude.(iii) No assumptions on statistical properties of noise.We do not make assumptions regarding the noise statistics such as the noise being a martingale or white.
As in [Recht, 2019, Mania et al., 2019, Dean et al., 2019], we focus on non-iterative methods which do not require an initial stabilizing controller, as instead typically assumed in iterative methods [Cohen et al., 2019, Abbasi-Yadkori et al., 2018].The main difference with respect to [Recht, 2019, Mania et al., 2019, Dean et al., 2019] is that our method is direct and assumes no noise model.
The advantage of not relying on noise statistics is twofold.
Although the solution to LQR can be interpreted as the one minimizing the variance of the system's output in response to white noise, experimental data need not comply with such setting, and show correlation and dependence (dependence breaks the i.i.d.assumption used in [Mania et al., 2019, Dean et al., 2019]).Our method is free from this issue, while it also allows for simple noise-reduction strategies for random noise.Not relying on noise statistics also enables us to directly extend the analysis to the stabilization of nonlinear systems around an equilibrium since, around an equilibrium point, a nonlinear system can be expressed via its first-order approximation plus a remainder which acts as a noise source.We will elaborate on these point in the paper.
Our method is direct (model-free).There is currently a great debate regarding the effectiveness of direct methods versus indirect ones [Tu and Recht, 2019].Here, we will not enter this debate partly because existing methods for LQR give probabilistic results while our method is non-probabilistic, and since the question of what is the "best" approach to take remains a question on the priors.A strength of our method (of direct methods in general) is a parsimonious use of such priors, which allows us to cope with situations where the noise has no convenient statistics.In such situations indirect methods (at least those proposed for LQR) are instead much more difficult to pursue since the ID step is strongly reliant on such statistics [Mania et al., 2019, Dean et al., 2019].On a similar vein, direct methods can be directly applied in settings such as with nonlinear or time-varying dynamics where system ID is typically more involved [Wabersich and Zeilinger, 2018, Dai and Sznaier, 2018, De Persis and Tesi, 2020, Guo, De Persis, and Tesi, 2020].Finally, by skipping the ID step, direct methods are often much more handy than indirect methods.For instance, regarding the LQR problem, our method does not require any bootstrap method or reset of the system's state, as needed in [Mania et al., 2019, Dean et al., 2019].

Outline of the paper
Our method rests on a fundamental result by Willems and co-authors [Willems, Rapisarda, Markovsky, and De Moor, 2005] recalled in Section 2. Roughly, this result states that a (noise-free) system trajectory generated by a persistently exciting input is a data-based non-parametric system model.We exploit this result to develop our model-free method.In Section 3, we formulate the LQR as an H 2 problem [Scherer and Weiland, 2019] and derive a data-based solution based on convex programming for the ideal case of noise-free data (Theorem 1).The main results are given through Sections 4 and 5.The first one (Theorem 2) provides stability properties and error bounds of the baseline solution in case of noisy data.Two variants to the baseline solution are discussed in Theorems 3 and 4.These variants guarantee more tolerance to noise at the cost of possibly reduced performance bounds.This matches what has been observed in indirect methods [Mania et al., 2019] for noise-robust solutions.Practical aspects and extensions are discussed in Section 6, including nonlinear systems and de-noising strategies.Section 7.1 provides numerical simulations while Section 8 gives concluding remarks.

Notation and auxiliary facts
Given a signal z : N → R σ and two integers k and r where r ≥ k we define z [k,r] := {z(k), z(k + 1), . . ., z(r)}.Given a signal z and a positive integer T , we define As we will always consider data coming from experiments of length T , we write Z i instead of Z i,T for brevity.
Consider a linear time-invariant system where x ∈ R n is the state and u ∈ R m is the control input, and suppose that we have access to T -long data sequences u [0,T −1] and x [0,T −1] of system (1).Throughout the paper, the condition where plays an important role.Condition (2) guarantees that any T -long input-state trajectory of the system can be expressed as a linear combination of the columns of W 0 , meaning that W 0 encodes all the information regarding the dynamics of the system.A fundamental property established in [Willems et al., 2005] is that one can guarantee (2) when the input is sufficient exciting.This property has very much in common with the notion of active exploration [Fiechter, 1997] used in many reinforcement learning methods.
Definition 1 [Willems et al., 2005] A signal z [0,T −1] ∈ R σ is said persistently exciting of order s ∈ N 1 if the matrix has full rank σs.
3 Problem definition and data-driven formulation In this section, we introduce the problem of interest and our baseline direct (model-free) data-driven method, which rests on condition (2) above.

The Linear Quadratic Regulator problem
Consider a linear time-invariant system where x ∈ R n is the state, u ∈ R m is the control input, and where d is a disturbance term; z is a performance signal of interest; (A, B) is controllable; W x 0 and W u ≻ 0 are weighting matrices with (W x , A) observable.In the sequel, to simplify the notation we set W x = W u = I, although all the results easily extend to the general case.
We consider the problem of designing a state-feedback controller K that renders A + BK Hurwitz and minimizes the H 2 -norm of the transfer function T (K) : d → z of the closed-loop system where [Chen and Francis, 1995, Section 4.4] trace h e jθ ⊤ h e jθ dθ 1 2 In particular [Chen and Francis, 1995, Section 4.4], when A + BK is Hurwitz, where P is the controllability Gramian of the closed-loop system (5), which is the unique solution to This corresponds in the time domain to the 2-norm of the output z when impulses are applied to the input channels, and can be interpreted as the mean-square deviation of z when d is a white process with unit covariance, which is the classic stochastic LQR formulation.Here, we view the LQR problem as a H 2 -norm minimization problem as our method is based on the minimization of (6).
It is known [Chen and Francis, 1995, Section 6.4] that the state-feedback controller which minimizes the H 2 -norm of T (K) is unique and can be computed as where X is the unique positive definite solution to the classic discrete-time algebraic Riccati (DARE) equation We are interested in computing K opt when a model of the system is not available, and we only have access to a T -long stream of (nosy) data u [0,T −1] and x [0,T −1] collected during some experiment on system (4).By noisy we mean that the data collected from (4) might have been generated with nonzero disturbance d.In particular, we aim at establishing properties of the data-driven solution with respect to the one that we can compute under exact model knowledge.

A data-driven SDP formulation
The problem of finding K opt can be equivalently formulated as a semi-definite program (SDP):2 min (γ,K,P,L) γ subject to This formulation is the natural discrete-time counterpart of the formulation proposed in [Feron, Balakrishnan, Boyd, and El Ghaoui, 1992] for continuous-time systems.We will not discuss the properties associated to (9).Rather, we will discuss the properties associated to an equivalent data-based version of (9).
Consider system (4) along with data sequences d [0,T −1] , u [0,T −1] and x [0,T ] resulting from an experiment of length T .Define corresponding matrices D 0 , U 0 , X 0 and X 1 , which satisfy the relation It turns out that the controller K opt can be parametrized directly in terms of the data matrices D 0 , U 0 , X 0 and X 1 .Specifically, under condition (2) the controller K opt can be expressed as where the tuple (γ o , Q o , P o , L o ) is any optimal solution to the SDP: which only depends on data.
The idea behind this formulation is that, under condition (2), any feedback interconnection A + BK can be rewritten in a form which does not involve the matrices A and B. In fact, under condition (2), for any K there exists a matrix G that solves the system of equations This implies Thus the formulation ( 12) coincides with the one in ( 9) with Q = GP .In particular, K = U 0 QP −1 and X 0 Q = P provide an equivalent characterization of the two constraints in (13).
Formulation ( 12) first appeared in [De Persis and Tesi, 2020] under the assumption that the collected data are noise-free, that is with D 0 = 0, in which case K opt can be directly computed from data.Here, we revisit this result providing some additional properties related to this formulation.
Theorem 1 Suppose that condition (2) holds.Then problem (12) is feasible.Further, any optimal solution The proof of Theorem 1 relies on two auxiliary results.
Proof of Theorem 1.Since K opt is a stabilising controller, Lemma 3 ensures the existence of a tuple (γ, P, Q, L) which is feasible for (12).We now consider the second part of the statement.By Lemma 2, the controller Moreover, since K opt is stabilising, by Lemma 3 there exists a tuple (γ, P, Q, L) feasible for (12) such that K opt = U 0 QP −1 and T (K opt ) 2 2 = trace(P ) + trace(L).As a final step, note that by definition (γ o , Q o , P o , L o ) is an optimal solution to (12).Thus trace(P o ) + trace(L o ) ≤ trace(P ) + trace(L).
In turn, this implies T (K) 2 ≤ T (K opt ) 2 .However, since K opt is the controller minimising the H 2 -norm of the system we must have T (K) 2 = T (K opt ) 2 so that K = K opt as the optimal controller is unique.
As it emerges from the proof of Lemma 3, (12) admits infinite optimal solutions in Q of the type where Q ∼ is any matrix in the right kernel of W 0 .

Data-driven solution with noisy data
From previous analysis, when data are noise-free, K opt can be computed directly using (12).When D 0 = 0, (12) cannot be solved unless we know D 0 .In this section, we provide a first solution to the case when D 0 is nonzero and is not measured.This solution offers a quantitative relative error (the gap between the computed solution and K opt ) without making assumptions regarding the noise statistics.
A simple variant of ( 12) which can be computed from data alone and does not involve the knowledge of D 0 consists in disregarding the noise term: If a solution is found then the corresponding controller is computed as K = U 0 QP −1 .Three main questions arise: 1.A solution need not exist.2.Even if a solution is found, the corresponding controller K need to be stabilising.3.Even if a solution is found and K is stabilising, the performance achieved by K might still substantially differ from the performance achieved by K opt .
In the sequel, we will focus on items 2 and 3 above.Item 1 is implicitly addressed in the analysis.Suppose that a solution (γ, Q, P , L) to ( 15) is found, and denote by With this notation in place, we aim at establishing the following chain of relations: for some real constants η 1 , η 2 ≥ 1.Note in particular that the first inequality ensures that K is stabilising.

Stability and performance analysis
We will focus on the two inequalities in ( 16) as the equality follows from Theorem 1.Consider the first inequality.The idea is to find conditions under which there exists a constant η 1 such that η 1 (γ, Q, P , L) is a feasible solution to (12).Then the inequality follows from Lemma 2. For brevity, we introduce some additional notation.Define With this notation the first constraint in (15) reads Θ+I 0 while the first constraint in (12) reads Θ + Ψ + I 0. In the sequel, it is understood that all the solutions of interest inherit the same notation.In particular, we will use M , Θ and Ψ to denote the matrices corresponding to (γ, Q, P , L) and M o , Θ o and Ψ o to denote the matrices corresponding to Lemma 4 Suppose that (15) is feasible.Let (γ, Q, P , L) be any optimal solution and let Then K stabilises system (4) and Proof.The idea is to show that although (γ, Q, P , L) need not be feasible for (12), under the condition (18) a feasible solution to ( 12) is given by (γ, Q, P , L) := η 1 (γ, Q, P , L).
The second inequality in ( 16) is similar to the first one.The point is to find conditions under which we can associate to K opt some tuple η 2 (γ o , Q o , P o , L o ) feasible for (15).In this case, the second inequality immediately follows from the fact that (γ, Q, P , L) is optimal for (15).
Lemma 5 Suppose that condition (2) is satisfied, and let then problem (15) is feasible.Moreover, any optimal solution (γ, Q, P , L) is such that Proof.By Theorem 1 condition (2) ensures that problem ( 12) is feasible and ) is any optimal solution to (12).As before, the idea is to show that although (γ o , Q o , P o , L o ) need not be feasible for ( 15), under condition (20) a feasible solution to ( 15) is given by where the inequality follows from η 2 (Θ o + Ψ o + I) 0 and (20).Hence (γ, Q, P , L) satisfies the first constraint of (15).Since (γ o , Q o , P o , L o ) is feasible for (12) then (γ, Q, P , L) satisfies by construction also all the other constraints of (15).Hence the claim follows since (γ, Q, P , L) is optimal and since the cost associated with (γ, Q, P , L) satisfies which establishes (21).
We summarize our findings in the following result.
Theorem 2 Let U 0 , X 0 and X 1 be data generated from an experiment on system (4) possibly with nonzero disturbance vector D 0 .Consider problem (15).

Preliminary discussion
The bound in (ii) of the above theorem defines the relative error with respect to the optimal solution.In our setting, this bound holds with no prior assumptions on the noise statistics, for instance the noise being a martingale or white.We note in particular that this error nicely scales with η 1 and η 2 , and converges to zero as D 0 goes to zero since, in this case, both η 1 and η 2 converge to one.
Conditions ( 18) and ( 20) play a different role.The first one ensures that any solution to problem (15) returns a stabilizing controller.This condition can be checked from data alone whenever some prior information on d is available.Instead, condition (20) makes it possible to explicitly quantify the performance gap between the solution and K opt .Different from ( 18), condition (20) cannot be checked from data as it depends on the (unknown) optimal controller K opt .In the next section, we will nonetheless discuss an interesting fact related to (20), namely that this condition is actually easier to satisfy in practice than (18).We postpone a discussion on this point to Section 5 and first consider a variant of (15) with the goal of rendering (18) easier to fulfil.

Noise robustness through soft constraints
Condition (18) may be difficult to satisfy unless d has very small magnitude.In fact, in order to satisfy (18) one needs Ψ ≺ I, where However, there is no constraint on the magnitude of M in (15) with the consequence that a small level of noise may generate non-stabilizing controllers.This observation hints at modifying ( 15) by adding a constraint on the magnitude of M = QP −1 Q ⊤ .Consider the SDP: Compared with (15), we now search for solutions that lead to matrices M = QP −1 Q ⊤ having small trace, equivalently such that QP − 1 2 has small singular values.A soft constraint favours small values of M while preserving all the logical steps of the baseline solution.
We proceed as before.Assume that a solution (γ, Q, P , L, V ) to ( 23) is found and let K = U 0 QP −1 be the corresponding controller.Let (γ o , Q o , P o , L o ) be any optimal solution to (12) with K opt = U 0 Q o P −1 o .We aim at establishing the following chain of relations: for some real constants η 1 , η 2 ≥ 1 with

Stability and performance analysis
The first inequality follows as in Lemma 4.
Lemma 6 Suppose that (23) is feasible.Let (γ, Q, P , L, V ) be any optimal solution and let K = U 0 QP −1 .Let η 1 ≥ 1 be a constant.If condition (18) is satisfied then K stabilises system (4) and Proof.The proof is analogous to the one of Lemma 4 and therefore omitted.(Note in particular that the constraint in (23) that involves V does not appear in (12).) We also have a natural counterpart of Lemma 5, which establishes the second inequality in (24).
Lemma 7 Suppose that condition (2) is satisfied, and let Proof.The proof is essentially analogous to the proof of Lemma 5.By proceeding as in Lemma 5 it is immediate to verify that η where the constraint in (23) involving V is satisfied by the choice which establishes (27).
By combining these two lemmas the following result can be stated.
Theorem 3 Let U 0 , X 0 and X 1 be data generated from an experiment on system (4) possibly with nonzero disturbance vector D 0 .Consider problem (23).Then: (i) Under condition (18), any optimal solution (γ, Q, P , L) to ( 23) is such that the controller K = U 0 QP −1 stabilizes (4) with guaranteed performance with η 1 as in (18).(ii) If, in addition to (18), also the conditions (2) and (20) hold then ( 23) is feasible and the controller K satisfies with η 2 as in (20), and where Compared to the baseline solution, the error bound worsen due to the extra term η 3 .This matches what is observed in indirect methods [Mania et al., 2019] for noise-robust solutions.As we discuss in Section 6, the conservatism can be nonetheless kept to moderate values.
Remark 8 (Implementation of ( 23)) Problem (23) can be given the equivalent form of an SDP: Similar considerations apply to (15).

Alternative based on the S-procedure
The key to overcome the lack of knowledge about D 0 in the problem ( 12) is to completely disregard such a term to obtain the implementable form (15), from which a robust version with soft constraints on M (23) is eventually derived.An alternative to this approach is to explicitly impose conditions on D 0 with the perspective of obtaining a potentially more robust version of ( 12) and ( 23).We discuss advantages and drawbacks of the approach.
As in [De Persis and Tesi, 2020], we initially consider a condition on D 0 in the form where R is a full-row rank matrix and µ is a positive real.
In the remainder of this subsection, we revisit this condition with the purpose of providing a variant of ( 23).We let the constraint (29) be regular, assuming that there exists a vector x such that x ⊤ D 0 D ⊤ 0 x < µ 2 x ⊤ RR ⊤ x.This condition is only introduced to motivate the formulation of the more robust variant of (23) and will be not be required in the main result of the section.
Under (29), by the S-procedure [Yakubovich, Leonov, and Gelig, 2004, Section 2.1.2],a necessary and sufficient condition to ensure the robust stabilizability condition in ( 12), namely to ensure that To get rid of the dependence on the unknown matrix D 0 , instead of the equivalent condition above, one can consider the sufficient condition for some τ ≥ 0, with M as in ( 17), and include this stronger variation in the following alternative to problem (15): where the role of the factor η 1 ≥ 1 will be explained later.We have previously discussed on the benefits of including a soft constraint on M = QP −1 Q ⊤ .We would like to have these benefits also in the robust variant of the SDP problem that we are studying here.We recall that the idea consists of bounding M via a matrix decision variable V , whose trace is made as small as possible in the optimization process.This is achieved by introducing the new constraint V −QP −1 Q ⊤ 0 and by modifying the existing constraint trace (P ) + trace (L) ≤ γ into trace (P ) + trace (L) + trace (V ) ≤ γ.
Let us observe now that a constraint on M is already present in the first constraint of (31) in the form M − τ I 0. Thus, it is natural to modify block (2, 2) in that constraint as M − V 0. However, the matrix −τ I in M − τ I 0 derived from imposing the condition (29) and the latter also motivated the introduction of the term µ 2 τ RR ⊤ in block (1, 1) in the first constraint of (31), which must be changed accordingly.These arguments lead to the following modification of (31) : We now establish the main properties of (32).
Theorem 4 Let U 0 , X 0 and X 1 be data generated from an experiment on system (4) possibly with nonzero disturbance vector D 0 .Assume that problem (32) is feasible with η 1 ≥ 1.Let (γ, Q, P , L, V ) be any optimal solution to (32) and let then K stabilises system (4) and Proof.Similarly to the proof of Lemma 4, the idea is to let (γ, Q, P , L) := η 1 (γ, Q, P , L) and show that it satisfies the first constraint in (12).In fact, by multiplying the first constraint in (32) by the vector x ⊤ [I D 0 ] on the left and its transpose on the right, it is straightforward to see that (32) implies Hence, in view of (33), i.e., it satisfies the first constraint in (12).Similar to what has been noticed in the proof of Lemma 6, by construction and thanks to the condition η 1 ≥ 1, the solution (γ, Q, P , L) satisfies all the other constraints in ( 12) and the thesis follows from Lemma 2.
This alternative formulation relies on arguments typical of the S-procedure for robust control, where here the noise is regarded as the source of uncertainty.Compared with ( 23), the formulation (32) has the advantage to explicitly address robust stabilization by incorporating η 1 directly into design.We see that this parameter embodies a tradeoff: the smaller its value is the higher is the chance to satisfy the first constraint in (32), hence to have a stabilizer robust to noise, but the coarser is the upper bound on the H 2 -norm of the closed-loop system.On the other hand, the drawback of this formulation is that it is not clear at the moment how to establish the analogous of Lemma 7 and therefore of Theorem 3, which give us a quantitative bound on the error function.Overall, the expectation is that ( 32) provides an increased robustness to noise than ( 23) at the expense of decreased performance in terms of optimality, similarly to what we have when comparing ( 23) to ( 15).This expectation is confirmed by the Monte Carlo simulations that we perform in Section 7.1.
Remark 9 (Implementation of ( 32)) Since X 0 Q = P and which gives rise to an SDP analogous to (28).Problem (32) can be then implemented through a line search on η 1 as we will illustrate in the numerical simulations.
6 Stability and performance verification, nonlinear systems and de-noising We devote this section to discuss some practical aspects of the proposed method as well as possible extensions of the previous analysis.

Stability and performance verification
Sections 5 and 6 give stability and performance properties of our data-driven approach to the LQR problem.We now discuss how to infer these properties by only looking at the data.

Stability and H 2 -norm bounds
Inferring stability and performance of K inevitably requires some prior assumptions on the quality of data, hence on the noise.Our method makes a parsimonious use of such priors (i.e.no noise statistics are needed), much in the spirit of robust control design [Scherer and Weiland, 2019].
To ensure stability with an H 2 -norm bound, Theorem 2 and 3 require the fulfilment of the condition (18) which involves the noise-dependent matrix If one knows that D 0 ≤ δ for some δ > 0 then condition ( 18) is satisfied if for some η 1 ≥ 1, which can be checked from data alone.
A bound on D 0 can be obtained from prior information on the noise magnitude, and is representative of the noise energy content.Condition ( 35) has the same features of the original condition (18) in the sense that is becomes easier to satisfy for small values of M .
Similar considerations apply to Theorem 4 where, under the bound D 0 ≤ δ, condition ( 33) is satisfied if which can be checked from data alone.We note that for both ( 35) and ( 36) there is an evident tradeoff between priors and conservativeness: the larger the value of δ is the higher is the chance that the assumption on the noise is satisfied, but the lower is the chance that ( 35) and ( 36) hold.We also note that approaches other than the one discussed can be used.In particular, if D 0 is known to belong to some compact set D then one can check (18) through a finite set of linear matrix inequalities computed at the vertices of a convex embedding of D as done in [Bisoffi, De Persis, and Tesi, 2019], albeit this approach is computationally demanding.

Bounds on the relative error
To achieve bounds on the error also conditions ( 2) and ( 20) are used.Condition (2) states that the data are sufficiently rich in content.As noted in [van Waarde, Eising, Trentelman, and Camlibel, 2020] this condition is not restrictive for the LQR problem.In fact, in the noiseless case, this condition is necessary for reconstructing A and B from data (hence for any model-based solution and indirect data-driven method) and is also generically necessary for any direct data-driven method.3 Condition (2) can be checked from data.In the noise-free case it can be enforced at the experiment stage (Lemma 1).
In the noisy case (2) may or may not hold depending on the noise level (although it always remains checkable from data).Nonetheless, it is simple to see that (2) continues to hold in the presence of noise when d is sufficiently small.In fact, by linearity, W 0 can be decomposed as where X u 0 and X d 0 represent the state data generated by u and d, respectively.Since by Lemma 1 the matrix involving U 0 and X u 0 has full row rank then also W 0 has full row rank whenever d has sufficiently low magnitude.
We next focus on the condition (20).The structure of ( 20) is analogous to (18), with the difference that it involves the matrix 18), also ( 20) is automatically satisfied for D 0 = 0, in which case η 2 = 1.Different from ( 18), ( 20) cannot be checked from data as it depends on the (unknown) optimal controller K opt via M o .Nonetheless, an interesting fact related to (20) is that this condition is actually easier to satisfy than (18).This indicates in particular that the robust solution in Theorem 3 does not introduce much conservatism with respect to the baseline solution in Theorem 2. We now elaborate on this point.
In both Theorems 2 and 3, the performance gap between K and K opt holds for any optimal solution (γ o , Q o , P o , L o ) to problem (12), and this is possible since by Theorem 1 all the solutions are such that We now derive a particular (optimal) solution, the derivation being analogous to the one in Lemma 3. Let P o ≻ 0 be the unique solution to (A + BK opt )P o (A + BK opt ) ⊤ − P o + I = 0.In particular, P o is the controllability Gramian of the closed-loop system.Let now This particular solution is optimal as it achieves the same cost of any other optimal solution (Theorem 1).The special feature of this solution is that G o is the minimum norm least-squares solution to (13) with K = K opt , and so is turns out to be satisfied more easily than (18) since the matrix M appearing in (18) is instead not necessarily associated to any minimum norm solution.We note that Q o (thus M o ) decreases as the norm of W 0 increases, which happens for instance when the number T of collected data increases.This implies in particular that V o approaches 0 as W 0 increases.In turn, this means that the formulation (23) does not introduce much conservatism with respect to the formulation (15) since the performance bound 2 when the norm of W 0 increases.

Nonlinear systems
The previous analysis extends to the problem of finding the LQR law for a nonlinear system around an equilibrium using data collected from the nonlinear system.In fact, around an equilibrium a nonlinear system can be expressed via its first order approximation plus a reminder, which acts as a process disturbance for the linearized dynamics.
Consider a smooth nonlinear system where ξ is a process disturbance, and let (x, u) be a known equilibrium pair, that is such that x = f (x, u).Thus, we can rewrite the dynamics as with δx := x − x, δu := u − u, and with d := ξ +r, where r accounts for higher-order terms and it has the property that is goes to zero faster than δx and δu, namely we have where R(δx, δu) ia a matrix of smooth functions with the property that R(δx, δu) goes to zero as [δx ⊤ δu ⊤ ] ⊤ goes to zero.Now, if the pair (A, B) defining the linearized system is stabilizable then a controller K rendering A + BK stable also exponentially stabilizes the equilibrium (x, u) for the original nonlinear system.Thus, the analysis in Theorem 3 carries over directly to this case (similar conclusions apply to Theorem 4).
Corollary 1 Consider a nonlinear system as in (38), along with a known equilibrium pair (x, u). and let K opt be the optimal LQR controller of the system linearized around (x, u).

De-noising through averaging
Several de-noising strategies can be adopted when the noise features are known, popular methods being the Singular Spectrum Analysis, the Cadzow algorithm, and structured low-rank approximation [Chu, Funderlic, andPlemmons, 2003, Golyandina, Nekrutkin, andZhigljavsky, 2001].Here, we discuss a simple de-noising strategy based on averaging of ensembles [Wang and Uhlenbeck, 1945].
Roughly, the idea is that for signals affected by random noise the components due to noise can be filtered out by taking an average of several signal "cycles".This can be done by considering a single trajectory of length T * and cutting it into N pieces of length T (single trajectory ensemble) or by taking N measurements of length T (multiple trajectory ensemble).We now elaborate on this idea considering the case of multiple trajectory ensembles.
Given N matrices S (n) with n = 1, . . ., N , let denote their average.For a given N , let be the dynamics of (4) over a generic experiment (cycle) n with n = 1, . . ., N .Thus, x (n) , u (n) and d (n) are the state, input, and disturbance signals associated with the experiment n.By linearity, if we collect T samples in each experiment the resulting tuples , n = 1, . . ., N satisfy the relation Hence, the average signals still provide a valid input-output system trajectory, meaning that all previous results apply to this case without any modifications.
For random noise, however, using (43) can be advantageous with respect to using (10), that is one single experiment.
To see this, consider the case of N (repeated) experiments carried out with persistently exciting input signals u (n) = u for all n = 1, . . ., N and arbitrary initial states, and suppose that the noise realizations d (n) are i.i.d. with zero mean and covariance matrix σ 2 I.Under these conditions (43) holds with U 0 = U 0 , i.e.X 1 = AX 0 + BU 0 + D 0 .This ensures that the average trajectory arises from a persistently exciting input, which is needed for having (2) fulfilled (the average of persistently exciting signals need not result in a persistently exciting signal).With this appraoch, ( 35) and (36) (thus ( 18) and ( 33)) become easier to satisfy.In fact, where the accuracy of the approximation increases with T (the relation being exact in terms of expectation).Hence, showing an approximate reduction by a factor of N (indeed, this is nothing but a consequence of the fact that averaging N i.i.d realizations reduces the variance by a factor of N ).This procedure is illustrated in the numerical simulations which follow.

Monte Carlo simulations
In this section, we support our theoretical findings through simulations on linear and nonlinear systems.

Random linear systems
We consider 100 systems as in (4) with n = 3 and m = 1, under three types of noise: white Gaussian noise (WGN), constant bias and sinusoidal disturbances.In all the cases, we also consider different levels of noise.For every type (and level) of noise we test ( 23) and ( 32) in all the systems.Numerical simulations have been carried out in Matlab.For each experiment, we choose the entries of the matrices A and B and of the initial state from a normal distribution with zero mean and unit variance, abbreviated by N (0, 1) (command randn).For each experiment, the controller was designed using T = 20 samples generated by applying an input signal u ∼ N (0, 1) (by Lemma 1 condition (2) requires a minimum of 7 samples).
WGN has been generated taking d ∼ N (0, σ 2 I), where σ represents the standard deviation.We varied σ considering different scenarios of the signal-to-noise (SNR), computed (command snr) by comparing the variables Bu (signal) and d (noise).This SNR measures how much noise enters the system relatively to the intended input signal.Constant bias was chosen by applying to each input channel a value κ taken from a uniform distribution in (−κ, κ).Finally, sinusoidal disturbance was chosen by applying to each input channel a signal κ sin(k) with κ given as above.
We denote by S the percentage of times we get a stabilizing controller.We also compute the performance gap between the controller found via ( 23) and ( 32) and the optimal one.Specifically, for each type (and level) of noise, we let K (k)  and K (k) opt with k = 1, . . ., 100 denote the controller found via (23) or ( 32) and the optimal one for the k-th experiment, and let represent the relative performance error.We denote by M the median of E k through all the experiments that return a stabilizing controller.Each type (and level) of noise was tested with the same set of plant matrices and inputs.Finally, we denote by V the percentage of times we infer stability via ( 35) and ( 36) assuming some prior knowledge on d.As for WGN, we selected δ in ( 35) and ( 36) by taking σ = 1.5σ (50% overestimate of σ) and by setting δ = √ T σ (cf.( 44)).As for constant and sinusoidal disturbances, we consider a worst-case estimate D0 = κ1 n×T where 1 n×T is the n × T matrix of all ones, yielding δ = √ T nκ.These values of δ give a correct over-approximation of the norm of D 0 in all the experiments.These values of δ are also used to implement (32).Specifically, we implemented (32) by first computing the smallest µ 2 such that δ 2 I µ 2 RR ⊤ with the choice R = X 1 and then by performing a line search on η 1 .The choice R = X 1 has robust stability interpretations [De Persis and Tesi, 2020, Section V] and proved effective in the simulations.
For solving ( 23) and (32) we used CVX [Grant and Boyd, 2014].The results are reported in Table 1, and they can be summarized as follows: 1.In all experiments, both methods ( 23) and (32) perform well for reasonable values of the SNR (≥ 25dB) as well as for low-medium SNR values in the range (10, 20)dB.
The method (23) performs better in terms of relative error but is slightly less robust, in line with the discussion in Section 5.2.For very low SNR (≤ 5dB) the performance of both methods drop.We note (not reported in Table 1) that both methods settle to S = 76% for SNR ≤ −5dB regardless of σ.This happens since 76% of systems are open-loop stable, and K = 0 is feasible for both methods when U 0 , X 0 and X 1 have full-row rank.In this case, both methods select Q such that U 0 Q = 0, X 1 Q = 0 and X 0 Q = I, leading to P = I and K = 0. 2. For both ( 23) and ( 32), robustness to noise can be further enhanced by adding a weight α > 1 to the term trace(V ), so as to favour robustness over accuracy relative to K opt (cf.Section 5).For instance, for WGN with σ = 0.1 the program (23) achieves S = 96% with α = 10, but at the expense of a reduced performance M = 0.0380.Using a weight α > 1 can be beneficial also for stability inference (quantity V) since smaller V render ( 35) and ( 36) easier to fulfil.For instance, under the same conditions as above V increases from 11% to 46%.
3. For WGN, robustness can be increased also by averaging trajectories from multiple experiments (cf.Section 6.3)This is advantageous with respect to adding a penalty on trace(V ) because no performance losses are introduced.
To emphasize this point, the last three rows of Table 1 report the results with N = 100 repeated experiments for each system, although N = 10 suffices to get S = 96% with median relative error M = 0.0034 for σ = 0.1, and S = 90% with M = 0.0296 for σ = 0.5.4. With stable dynamics increasing T is usually beneficial for performance.From a theoretical viewpoint, this is due the fact that increasing T reduces the term trace(V 0 ), thus the relative error (cf.Section 6.1.2).With unstable dynamics this advantage is offset by the fact that the noise effect amplifies, and this renders stability more difficult to achieve.In fact, we observed that decreasing T actually gives an increase of S in almost all scenarios since in this case stabilization of the unstable systems becomes easier.(for instance, with T = 10 we obtain S = 82% for WGN with σ = 0.5).
We have also tested our methods on the Laplacian system considered in [Dean et al., 2019].With (23), under the same setting (input and noise in N (0, 1)), an average of N = 10 trajectories of length T = 20 is sufficient to get S = 100% with M = 0.6569 over 100 experiments made by randomly changing input and noise patterns.To further decrease M one needs to increase T (and N ).In this case, increasing T does not bring issues since the dynamics are mildly unstable and the input signals have zero mean.

Nonlinear inverted pendulum
Consider the Euler discretization of an inverted pendulum.
The system is as in (38) with where ∆ is the sampling time, m is the mass, ℓ is the distance from the base to the center of mass of the balanced body, µ is the coefficient of rotational friction, and g is the acceleration due to gravity.The states x 1 , x 2 are the angular position and velocity, respectively, u is the applied torque.The system has an unstable equilibrium in (x, u) = (0, 0) corresponding to the pendulum upright position so that δx = x and δu = u.We assume that the parameters are ∆ = 0.01, m = ℓ = 1, µ = 0.01, and g = 9.8.
We made 100 experiments by considering initial conditions in N (0, 0.1), corresponding to an initial displacement from the equilibrium of about ±10 • , and u ∼ N (0, 1).The results are in line with the previous ones.In particular, when ξ = 0 (the only disturbance source is the nonlinearity) we obtain S = 100% with M = 0.0356 using ( 23) with trajectories of length T = 20.We also considered the case of WGN noise affecting the velocity dynamics, i.e. with u replaced by u+ξ with ξ ∼ N (0, σ).In this case, we obtain S = 100% for σ ≤ 0.1 (SNR ≥ 20dB) up to S = 12% for σ = 1 (SNR ≈ 0dB).Similar results are obtained with (32) and under different settings, that is with different types of noise and samples T .Since the equilibrium is unstable, reducing T can be beneficial for values of u and ξ that steer the system far from the equilibrium (for instance, using T = 10 we get S = 36% for WGN σ = 1).As for linear systems, at the expense of reduced performance, robustness can be enhanced by adding a weight α > 1 to the term trace(V ) (for instance, setting α = 10 we obtain S = 64% for WGN with σ = 1).

Concluding remarks
The design of (optimal) controllers from noisy data is a very challenging and largely unsolved problem.In this paper we took some steps in this direction for the LQR problem.By resorting to a convex SDP formulation of the LQR problem, we proposed two novel methods that explicitly account for noise through an augmented cost function which favours noise-robust solutions.Both method provides finite sample stability guarantees, and do not require specific noise models such as the noise being white.
A great leap forward would come from extending the ideas of this paper to incorporate state and input safety constraints [Wabersich and Zeilinger, 2018].At the moment of writing, we aim at tackling this challenge using concepts and tools from set-invariance control.For stabilization problems with no optimality requirements, recent results have shown that data-based formulations of set-invariance properties can be efficiently cast as linear programs, and they can handle noisy data [Bisoffi et al., 2019].

A Appendix
Proof of Lemma 2. The proof follows the same logical steps as [Scherer and Weiland, 2019, Proposition 3.13] given for the model-based approach.Here, we consider a data-based version.Since X 0 Q = P and K = U 0 QP −1 we have This implies A + BK = (X 1 − D 0 )QP −1 .Hence, the first constraint in ( 12) is equivalent to S 0 where S := (A + BK)P (A + BK) ⊤ − P + I (A.1) Hence K is stabilising.As for the second part of the claim, since S is symmetric there exists a matrix Ξ such that S + ΞΞ ⊤ = 0. Thus, P coincides with the controllability Gramian of the extended system Proof of Lemma 3. Consider any stabilising controller and denote by P the controllability Gramian associated with the closed-loop system (5), which solves S = 0 with S as in (A.1).Consider the system of equations ( 13) in the unknown G, and which admits a solution under (2).In particular, pick This gives the claim.As it emerges from the proof, to any stabilizing controller we can associate an infinite number of tuples (γ, Q, P, L) feasible for ( 12) with Q = G * P + Q ∼ , where Q ∼ is any matrix in the right kernel of W 0 .

Table 1
Simulation results for 100 random linear systems.For WGN, the last three rows report the simulation results for (23) in case of repeated experiments.Similar results are obtained for (32).The values of SNR are in dB.