Trade-offs in learning controllers from noisy data

In data-driven control, a central question is how to handle noisy data. In this work, we consider the problem of designing a stabilizing controller for an unknown linear system using only a finite set of noisy data collected from the system. For this problem, many recent works have considered a disturbance model based on energy-type bounds. Here, we consider an alternative more natural model where the disturbance obeys instantaneous bounds. In this case, the existing approaches, which would convert instantaneous bounds into energy-type bounds, can be overly conservative. In contrast, without any conversion step, simple arguments based on the S-procedure lead to a very effective controller design through a convex program. Specifically, the feasible set of the latter design problem is always larger, and the set of system matrices consistent with data is always smaller and decreases significantly with the number of data points. These findings and some computational aspects are examined in a number of numerical examples.


Problem formulation and related work
In the last few years, there has been a renewed growing interest towards data-driven control. Data-driven control provides an alternative tool for control design whenever modeling based on first principles is difficult or impossible [22,21] and it has been employed for robust and optimal control [2,13], predictive control [10,1], and control of nonlinear [23] and time-varying systems [20].
In this paper, we start from a basic control problem, that is, designing a controller by a finite set of data with T samples collected from a system. Specifically, for state x ∈ R n , input u ∈ R m and disturbance d ∈ R n , we apply an input sequence {u(0), . . . , u(T − 1)} to a linear timeinvariant discrete-time system and measure the state sequence {x(0), . . . , x(T )} generated as response as x(i + 1) = A x(i) + B u(i) + d(i), i = 0, . . . , T − 1. (1) The matrices A and B are unknown and we do not have access to disturbance d, so data are noisy. The objective is to use the input and (noisy) state sequence to design a feedback control law u = Kx to solve a stabilization problem, that is, ensuring that A + B K has eigenvalues with magnitude strictly less than 1 (Schur stability).
Albeit basic, this problem poses many challenges. In fact, since d in (1) affects the state sequence, it is impossible to uniquely reconstruct the system matrices A This research is partially supported by a Marie Sk lodowska-Curie COFUND grant, no. 754315 and by NWO, project no. 15472.
Email addresses: {a.bisoffi,c.de.persis}@rug.nl (Andrea Bisoffi, Claudio De Persis), pietro.tesi@unifi.it (Pietro Tesi) and B from the input and state sequences; instead, we have a set of possible matrices that could have generated the state sequence. Accordingly, the problem corresponds to a robust stabilization problem in the face of the uncertainty introduced by d, entailing a set of matrices to stabilize rather than a singleton. In the literature, several works have considered problems of this type, in the very same setting as here [13,3,4,25], or in other settings like linear-quadratic-regulator design [15,27,14], robust set invariance [6], switched and nonlinear systems [11,12,17,7].
In all aforementioned works, the starting point is to define a model for the disturbance. Typical choices on opposite sides of the spectrum are disturbances with known probability distributions or statistics (e.g., d is assumed a white Gaussian noise [15,27,14]) or the so-called unknown-but-bounded models, as we consider in this paper. In unknown-but-bounded models, d may be a deterministic function of time or a stochastic process and the only characteristic assumed to be known is that d is contained in a given region [13,25,3,4,14,6,11,17,7]. In this setting, a natural way to describe the uncertainty is to consider instantaneous constraints for d as where ≥ 0 represents our prior knowledge on d. This model naturally arises in many practical cases (e.g., with process load disturbances), and it has a long history in set-constrained control and set-membership identification [16,5,18]. As an alternative to instantaneous constraints, where d 0 , d 1 , . . . , d T −1 correspond to any disturbance sequence and e ≥ 0 represents our prior knowledge on the disturbance sequence. This model (or similar ones involving quadratic bounds on the possible disturbance sequences) has been considered in many recent works on data-driven control [13,25,3,4]. The reason for considering D e is essentially technical. By using D e , the uncertainty set of system matrices consistent with data is given by a single quadratic matrix inequality and this enables deriving simple stability conditions [13,Thm. 5], or even necessary and sufficient conditions under a mild regularity property on data as in [25,Thm. 14]. From a practical viewpoint, however, this disturbance model is artificial. In fact, D e is typically built starting from D i , i.e., by setting e = T ; see [3,§IV], [25,Ex. 3] and [4,§VI]. As a consequence, the bound e = T in (3) increases with T and this may lead to a set of data-consistent system matrices that grows with T (see Example 1) so that using larger data sets may even be detrimental, which is undesirable and counterintuitive. Among the aforementioned recent works, few considered instantaneous bounds. They were used in [11,12] for data-driven control of switched and polynomial systems, where the problem was reduced to a nonconvex quadratic program and then relaxed to a polynomial optimization problem. Working directly with D i was also pursued in [4, §V] as a secondary development, by using sum-of-squares relaxations.

Paper contribution
In this paper, we investigate pros and cons of using directly an instantaneous bound instead of translating it into an energy bound. We show that working with D i is indeed advantageous since (i) the uncertainty set resulting from D i is never larger than the one resulting from D e with e = T ; (ii) there exist simple design methods that work directly with D i and always return a stabilizing solution whenever the methods based on D e do. We also show that the extra numerical cost due to working with D i is quite modest up to large sets of data. Compared to [4], our method rests on simpler arguments and uses design tools purely based on linear matrix inequalities (instead of polynomial constraint qualification conditions that may be hard to assess). As an auxiliary contribution, we introduce a notion of size for matrix ellipsoids to numerically measure the uncertainty induced by d. This measure confirms that the uncertainty resulting from D i is typically order of magnitudes smaller than that resulting from D e . 1 Throughout the paper, ≺ ( ) and ( ) denote, respectively, negative (semi)definiteness and positive (semi)definiteness for matrices. I is the identity matrix. denotes transpose.

Structure
Section 2 obtains a size notion for matrix ellipsoids. In Section 3, we present the controller design problems for the energy-bound and instantaneous-bound approach, and relate them in terms of feasibility. In Section 4, we further compare the two approaches by examining the corresponding sets of dynamical matrices consistent with data. The numerical investigation in Section 5 completes the comparison, and leads to the conclusions of Section 6.

Preliminaries
In this section, we set up the notation and introduce a notion of size for matrix ellipsoids, which will be used in the numerical simulations of Section 5.

Notation
For a matrix A, |A| denotes its induced 2-norm. Given two matrices A and B, A ⊗ B denotes their Kronecker product. For a matrix A = a 1 · · · a n ∈ R m×n partitioned according to its columns, vec(A) := [a 1 · · · a n ] denotes its vectorization. A property of vectorization is that, for matrices A, X, B and C of compatible dimensions, the matrix equation AXB = C is equivalent to (B ⊗ A) vec(X) = vec(C) [19,Lemma 4.3.1]. For fixed natural numbers m and n, the inverse of the vectorization operator takes as input a vector a = [a 1 · · · a n ] ∈ R mn partitioned in components a 1 , . . . , a n ∈ R m and returns vec −1 m,n (a) := a 1 . . . a n ∈ R m×n .
For matrices A, B and C of compatible dimensions, we abbreviate ABC(AB) to AB·C[ ] , where the dot in the second expression clarifies unambiguously that AB are the terms to be transposed.

A size notion for matrix ellipsoids
For symmetric matrices P ∈ R p×p , Q ∈ R q×q , A ∈ R p×p , C ∈ R q×q and matrices Z c ∈ R p×q , B ∈ R p×q , we term matrix ellipsoid a set in one of the next two forms: The constraints Q 0 and B A −1 B − C 0 ensure that E mat and E mat are not empty or do not reduce to a singleton; the constraints P 0 and A 0 ensure that the matrix ellipsoid is a bounded set. We stress that many sets considered in the sequel (e.g., C in Section 3.1 and I in Section 5.1) have to be expressed in terms of these matrix ellipsoids, and that (5a) and (5b) are natural extensions of the classical ellipsoids in the Euclidean space, cf. Standard computations reformulate E mat as Hence, E mat and E mat are the same set for Let us focus on E mat to define a size. Since Q 0, E mat is equivalently written as where Y Y I is equivalent to |Y| ≤ 1. Then, we specify the measure space in question and the adopted measure. Given natural numbers p and q, vec : R p×q → R pq is a bijection between R p×q and R pq with inverse vec −1 p,q : R pq → R p×q . With P(T ) denoting the power set of a set T , we define the bijection V : P(R p×q ) → P(R pq ) as which is closed, and thus Lebesgue measurable [24, Lemma 1.2.13]. Accordingly, E mat belongs to B and its measure is where it is standard to identify the volume of E vec with its Lebesgue measure (see, e.g., [9, p. 105]). The next lemma determines vol(E vec ).
Lemma 1. Let p and q be given natural numbers, and let Q ∈ R q×q and P ∈ R p×p be symmetric positive definite matrices. Then, where β is a constant that depends only on p and q.
Proof. Let The last term in this expression depends only on p and q and det( In view of (9) and Lemma 1, the set E mat has measure µ(E mat ) = β(det Q) p 2 (det P) q . The constant β is analogous to the proportionality constant in [8, p. 42], which is the volume of the 2-norm unit ball. The precise expression of β is irrelevant for our developments because we will compare only matrix ellipsoids with same p and q. Hence, we disregard β and define the size of E mat simply as Since the sets E mat and E mat are the same when the correspondences in (8) hold, their size is in that case Both terms in this expression reduce to classical volume formulae for Z ∈ R p×1 [8, p. 42].

Data-consistent dynamics and controller design problems
In this section we return to our original problem of designing a stabilizing controller for (1) from a finite data set. We first determine the set of system matrices consistent with the data points when the disturbance model is given by D i or D e , and then formulate the corresponding control design problems. From (2), (3) and e = T as discussed in Section 1, we have For notational convenience, all data points are grouped as and the set of their relevant indices i is I := {0, . . . , T − 1}.

Dynamics consistent with the data
We call consistent with data all the matrices (A, B) that, for the selected input sequence, could have generated the measured state sequence while keeping d in the bound D e in (11b) or D i in (11a), and we characterize the corresponding two sets in this section.
Based on the bound in (11b), the matrices (A, B) consistent with the data points are in i.e., all those matrices for which some disturbance sequence satisfying the bound in (11b) could have generated the measured data. By eliminating D in (12), C rewrites as Analogously, based on the bound in (11a), the matrices (A, B) consistent with a data point i ∈ I are in i.e., all those matrices for which some disturbance d satisfying the bound in (11a) could have generated the measured data point i. By eliminating d in (14), C i rewrites as The expression of C i in (15) mirrors that of C in (13), which was the reason to write dd I in (14) instead of the equivalent but more immediate |d| 2 ≤ . The set of matrices (A, B) consistent with all data points is then When neglecting the constraint A 0 in (6), both the sets C and C i can be expressed in the form (5b) of matrix ellipsoids. Indeed, the inequalities constituting the sets C in (13) and C i in (15) can be equivalently written as where the matrices A j , B j , C j with j = I for C and j = i for C i are defined in (17b), displayed below over two columns. This corresponds to (5b) with Z = A B . We note two facts. The set C is not necessarily bounded. It is bounded if A I 0 or, equivalently, the matrix X0 U0 has full row rank. This condition expresses the property that data are sufficiently rich in content, and is related to the notion of persistency of excitation [26]. On the other hand, condition A i 0 for C i is never satisfied. However, this is irrelevant since the set of interest when dealing with instantaneous bounds is I, formed by the intersection of all C i as in (16). We will prove below in Proposition 3 that I ⊆ C, which guarantees that I is bounded whenever C is.

Controller design methods
We now formulate the control design problems related to the two sets C and I, i.e., to the energy bound D e and to the instantaneous bound D i .
The energy-bound approach solves: where the condition in the second line expresses discretetime asymptotic stability. The instantaneous-bound approach solves: for all (A, B) such that (A, B) ∈ I.
Both the feasibility problems (18) and (19) correspond to robust stabilization in the face of the uncertainty introduced by d, which induces a set of matrices to stabilize rather than a singleton. We stress that we consider here a stabilization problem as a prototypical control problem, but our main conclusions would apply unchanged to other control problems like L 2 gain minimization or H ∞ control. By the lossless matrix S-procedure [25, Thm. 12] and Schur complement, the result [25,Thm. 13] shows that, under a mild regularity condition on data, feasibility of (18) is equivalent to feasibility of find P, Y, β, α where now the decision variables appear linearly. If (20) has a solution, K = Y P −1 is a stabilizing controller for (1). On the other hand, a tractable equivalent reformulation of (19) based on matrix ellipsoids cannot be obtained. Indeed, even in the simplest case where n = m = 1, the set I is an intersection of ellipsoids and finding the ellipsoid with minimum volume containing I is NP-complete [8, p. 44]. Hence, finding such an ellipsoid and applying then the necessary and sufficient conditions in [25,Thm. 12] is impractical. To pursue an alternative approach, we need the next lemma, which is an immediate extension to matrices of the classic lossy S-procedure [8, §2.6.3].
Lemma 2. Let T 0 , . . . , T ∈ R (q+p)×(q+p) be symmetric matrices. If there exist nonnegative scalars τ 1 , . . . , τ such Proof. For an arbitrary Z satisfying [ We use the lossy matrix S-procedure in Lemma 2 and obtain a sufficient condition to guarantee feasibility of (19) in the next proposition.

By Schur complement and change of variable
By Lemma 2, feasibility of this problem implies feasibility of (19), which proves the statement.
As in (20), the decision variables appear linearly in (21) and solving (21) yields the controller gain K = Y P −1 . With respect to (20), solving (21) involves T scalar variables τ i instead of a single one α. Having as many decision variables as the number of data points may be computationally demanding with (very) large data sets. We will further elaborate on this point later in Section 5.3.
The feasibility properties of (20) and (21) are related. In particular, the next proposition shows that solving (21) is as easy or easier than solving (20).
In the next section, we aim at quantifying the gap in terms of feasibility. To this end, we shift the focus from the design problems (18) and (19) to the corresponding sets C and I because (18) and (19) impose the very same stability condition on C and I, respectively. Hence, the smaller C or I is, the larger the feasible set [9, §4.1.1] of the design problems (18) or (19) is. Next section will actually show that I ⊆ C.

Comparison between energy and instantaneous bounds: Theoretical evidence
In this section, we firstly show a monotonicity property of the set I with respect to the number T of data points, and, secondly, a relation between the sets C and I.
Firstly, to illustrate monotonicity properties, we would like to highlight the dependence of C and I on the number T of data points. With some notation abuse, we then write equivalently C or C(T ), and I or I(T ). It is immediate that because I(T +1) contains precisely the same constraints as I(T ) plus an additional one. This is a desirable property because we expect more data to reduce the uncertainty by providing additional information, or, in the worst case, to keep the uncertainty at the same level. On the other hand, we build the next example to show that C(T + 1) ⊆ C(T ) does not hold in general.
which are depicted in Fig. 1. The ellipsoid corresponding to C(T ) does not always shrink with larger T , but more data (such as when T goes from 2 to 3) can actually induce larger bounds in an undesirable way.
Secondly, the next proposition provides a relation between C and I, which corroborates Proposition 2. Proof. We assume that (A, B) belongs to I = T −1 i=0 C i , and show that, then, (A, B) belongs to C. Let us then manipulate the left-hand side of the inequality in (13) as Each term of the sum is positive semidefinite since (A, B) is assumed to belong to T −1 i=0 C i , hence the whole sum is positive semidefinite. This proves that (A, B) ∈ C.
Proposition 3 establishes that the set I is contained in, or at most equal to, the set C. We are going to show now through numerical experiments that the former set actually has size significantly smaller than the latter.

Comparison between energy and instantaneous bounds: Numerical evidence
In this section, we complement the theoretical results in Propositions 2 and 3 with numerical evidence showing the actual relation between I and C, their dependence on T , and the impact of these two aspects on feasibility of the control design problems. For this comparison, we preliminarily obtain in Section 5.1 an over-approximation of I. We note that this over-approximation, although providing insights on this comparison, does not play any role in controller design.

Overapproximation I of the set I
Whereas the size of C can be obtained as in Section 2.2, a measure of I := T −1 i=0 C i is difficult to obtain exactly. In fact, I is an intersection of matrix ellipsoids, and for n = m = 1, even finding the ellipsoid of minimum volume containing I is NP-complete [8, p. 44]. We thus set up a convex optimization problem to obtain a computable over-approximation I of I. Considering I for the comparison with C is relevant because, albeit being an overapproximation, its size is still (significantly) smaller than the size of C in all the subsequent numerical examples.
We assume that the set I is bounded, cf. Section 3.1. Using the approach in [8, §3.7.2] for classical ellipsoids, we find a matrix ellipsoid I that includes all C i 's through the lossy S-procedure in Lemma 2, and we minimize the size of I, defined in Section 2.2. We take I as where we set C = B A −1 B − I (otherwise the representation of I is homogeneous) and require A 0 so that (6) holds. We impose I = T −1 i=0 C i ⊆ I via the lossy matrix S-procedure in Lemma 2, and obtain Since the first inequality is nonlinear in the decision variables B and A, we rewrite it by Schur complement as By (10), the size of I is given by (det A) − n 2 thanks to the adopted normalization. These constraints and objective function result in the optimization problem When n = m = 1, this optimization problem boils down to that in [8,Eq. (3.15)], but (24) also captures the case of generic dimensions n and m, as we need in the following.

A visualizable example
In this section we examine more thoroughly the system of Example 1 and show how the sizes of C and I depend on T and how they compare with each other.
We prolong the data sequences of Example 1 by using as input and disturbance the realizations of random vari-   Fig. 1, is now depicted for some T up to 1000 in Fig. 2. Fig. 2 shows that the energy-based ellipsoids C shrink very slowly as T increases, and stay approximately constant from T = 250 on. The resulting set I is depicted for the same values of T in Fig. 3 together with its overapproximation I, determined as in Section 5.1. Fig. 3 shows that the ellipsoids I shrink very quickly with T and, albeit only over-approximations of I, they are significantly smaller than the ellipsoids C. The actual sets I, which are depicted by shaded areas, shrink even faster to the point that they are barely visible already for T = 250.
For a clearer visualization, Fig. 4 depicts the ratio between the sizes of C and I. I is smaller than C by more than 30 times already with approximately 200 data points.

Example with third order dynamics
In this section, we delve into the considerations of Section 5.2 through a more complex example with matrices A data set is obtained by applying the realization of a Gaussian random variable (zero mean and unit standard deviation) as input, and as disturbance the realization of a random variable distributed uniformly in {d : |d| 2 ≤ } with different values for .
As in the previous example, we compare the sizes of C and of I assuming = 0.1. Their ratio is plotted in Fig. 5 for increasingly larger portions of data from T = 20 up to  T = 1000. The size of I is smaller than C by approximately two orders of magnitude already with 100 data points. We commented after Proposition 1 that the energy-based approach (20) requires T − 1 scalar decision variables less than the formulation in (21). We show numerically that the impact of these extra variables is modest even with a large number of data. The computations times 2 to solve (20) or (21) are depicted in Fig. 6. The controller design in (20) has the appealing feature that the associated computation time barely changes with the number of data points whereas (21) exhibits a dependence on T that results in a computation time that grows linearly with T .
Finally, we examine the joint effect of the disturbance bound and of the length T on both the approaches. For each value of and T , we solve a batch of 100 feasibility problems (20) or (21), count the number n feas of feasible ones in that batch and display the ratio n feas /100 ∈ [0, 1] in Figs. 7-8 through a color code. Yellow areas are the good ones in terms of feasibility. Fig. 7 depicts this ratio for the energy-bound approach and shows that above a threshold of approximately 0.7, the problem (20) has no solutions regardless of the number of data. This behaviour seems consistent with what we observed in Fig. 2 about the fact that the set of consistent matrices is not reduced by larger amounts of data. Fig. 8 depicts the ratio n feas /100 for the instantaneous-bound approach, and shows its appealing feature of being able to withstand increasingly larger disturbance magnitudes as long as an increasingly number of data points is collected to reduce uncertainty. Comparing Figs. 5-6 with Figs. 7-8, we can conclude that the price paid in terms of computation time is negligible compared with the gain in terms of robustness.

Conclusions
In this paper, we compared two different disturbance models for data-driven control design: a model D e considering energy bounds on the entire disturbance sequence and a model D i considering instantaneous bounds on the disturbance. The model D e leads to elegant necessary and sufficient condition for stability, and results in a design program whose number of variables depends only on the dimensions of the system to be controlled and not on the number of data points. On the other hand, the model D i reflects a much more natural way of describing disturbances.
We analyzed pros and cons of working directly with D i , instead of converting it into D e . The analysis shows that, computational considerations aside, it is in fact always preferable to work with D i . First, D i results in a design program that is always feasible whenever the one associated with D e is feasible. Second, numerical evidence shows that the feasibility gap can be extremely large. This latter aspect has been analyzed by introducing a notion of size for the uncertainty set induced by the disturbance. Simulations show that while the uncertainty set associated with D e does not necessarily shrink, the one associated with D i often shrinks very quickly, and is many order of magnitudes smaller. As for computational considerations, working with D i is less advantageous because it results in a design program whose number of variables depends on the number of data points. With the exception of extremely large data sets, however, the price paid in terms of computation time appears negligible compared with the advantages offered in terms of robustness.