Explicit construction of the minimum error variance estimator for stochastic LTI state-space systems

In this short article, we showcase the derivation of the optimal (minimum error variance) estimator, when one part of the stochastic LTI system output is not measured but is able to be predicted from the measured system outputs. Similar derivations have been done before but not using state-space representation.


Introduction
Realization theory of stochastic linear time invariant state-space representations (LTI-ss) with exogenous inputs is a mature theory [12,11].In particular, there is constructive theory for a minimal stochastic LTI-ss representation of a process y with exogenous input w.The construction uses geometric ideas, and it is based on oblique projection of future outputs onto past inputs and outputs.Note that, in system identification it is often assumed that jointly (y, w) has a realization by an autonomous stochastic LTI system driven by white noise.Indeed, if w has a realization by a stochastic LTI-ss representation driven by i.i.d gaussian noise, and y has a realization by a LTI-ss representation with exogenous input w and i.i.d.gaussian noise, then under some mild assumptions (absence of feedback from y to w) (y, w) will be the output of an autonomous stochastic LTI-ss representation.It is then natural to ask the question how to construct a minimal stochastic LTI-ss realization of y with input w, from an LTI-ss realization of the joint process (y, w), instead of computing a realization of y using oblique projections.In this paper we present an explicit construction of a minimal stochastic LTI-ss representation of y with an exogenous input w from an autonomous stochastic LTIss representation of the joint process (y, w).The basic idea is as follows: we will assume that (y, w) is stationary, square-integrable, zero-mean, jointly Gaussian stochastic processes and there is no feedback from y to w.Then use the result of [10] stating that there exists a minimal LTI-ss realization of (y, w) with matrices which admit a upper-triangular form.This allowed us to separate out part of the innovation noise of (y, w), which purely drives w, thus allowing us to formulate this construction.Our motivation for developing an explicit construction of an LTI-ss realization of y with input w from a LTI-ss realization of (y, w) was that this construction turned out to be useful in deriving non-asymptotic error bounds of PAC-Bayesian type [1] for LTI-ss systems [7].The latter could be a first step towards extending the PAC-Bayesian framework for stochastic state-space representations.More precisely, one of the byproducts of the construction of this paper is a one-to-one relationship between LTI-ss systems which generate (y, w) and optimal linear estimators of future values of y based on past values of w.This relationship is useful in Bayesian learning algorithms, when one needs to define a parameterised set of predictors (Hypothesis class).All prior knowledge or uncertainty in the data generating system can then easily be mapped to knowledge or uncertainty of the predictor.The contribution of the paper can also be viewed as as follows.We wish to construct an estimator of y(t) given past (s < t) and present (s = t) measurements of w(s).We consider a specific class of relationships, specifically when the two processes are related by a common stochastic LTI-ss system, i.e., [y T (t) w T (t)] T is an output of an LTI-ss system.The problem of finding this estimator can also be thought of as trying to estimate non-measurable quantities of a system from measurable quantities.
Related work: As it was pointed out above, stochastic realization theory with inputs is a mature topic with several publications, see the monographs [12,11,4] and the references therein.However, we have not found in the literature an explicit procedure for constructing a stochastic LTI-ss realization in forward innovation form of y with input w from the joint stochastic LTI-ss realization of (y, w).The current note is intended to fill this gap.We will need to further analyse the relationship between y and w by feedback-free assumption.In [9], the author defines what it means for one process to cause another, a similar notion to feedback.In [2], the authors further extend the notion and define weak and strong feedback free processes.As strong feedback free condition implies weak feedback free, we consider the relaxed case of weak feedback free throughout the paper.In frequency domain using causal real rational transfer function matrices to describe processes y and w, and analysing these processes with feedback free assumption, yields a straightforward construction of estimator of y given w, see [3] and [8].In this paper we study this problem in time domain, using LTI-ss representations.
Outline: This paper is organised as follows.Below we start by defining the notation and terminology used in this paper, then in Section 2 we reformulate the statespace system driven by innovation of [y T (t) w T (t)] T into a state-space system, which yields a realisation of y, driven by w and the innovation of a purely nondeterministic part of y.Afterwards in Section 4.1, given this new realisation we provide the optimal (in the sense of minimum error variance) estimate of y.
Notation and terminology Let F denote a σ-algebra on the set Ω and P be a probability measure on F. Unless otherwise stated all probabilistic considerations will be with respect to the probability space (Ω, F, P).In this paragraph let E denote some euclidean space.We associate with E the topology generated by the 2-norm || • || 2 , and the Borel σ-algebra generated by the open sets of E. The closure of a set M is denoted clM .For S ⊆ N and stochastic variables y, z 1 , z 2 , . . .with values in R we denote by E(y | {z i } i∈S ) the conditional expectation of y with respect to the σ-algebra σ({z i }) generated by the family {z i } i∈S .Recall that E(zx) define an inner product in L 2 (Ω, F, P) and that E(y | {z i } i∈S ) can be interpreted as the orthogonal projection onto the closed subspace L 2 (Ω, σ({z i } i∈S ), P) which also can be identified with the closure of the subspace generated by with only a finite number of summands in (1) being nonzero when S = N.Moreover, for a closed subspace H of L 2 (Ω, F, P) and a stochastic variable y with values in E and E(||y|| 2 2 ) < ∞, we let E(y | H) denote the dim(E)-dimensional vector with ith coordinate equal to E(y i | H) with y i denoting the ith coordinate of y.
There are two closed subspaces of particular importance.Following [12], for a discrete time stochastic process z(t) with values in E and E(||z(t)|| 2  2 ) < ∞, we write H − t (z) for the closure of the subspace in L 2 (Ω, F, P) generated by the coordinate functions z i (s) of z(s) for all s < t.That is, with T indicating transpose and only a finite number of summands in (2) being nonzero.In a similar manner we define Let A, B and C be closed subspaces of L 2 (Ω, F, P).We then define and say that A and B are orthogonal given C, denoted for all a ∈ A and b ∈ B.
We use the following notation, Y = R p , W = R q and for the disjoint union

Assumptions
Suppose we want to construct an estimator of the output stochastic process y(t) : Ω → Y given a sequence of measurements as inputs obtained from the stochastic process w(t) : Ω → W. In order to narrow down and formally describe the estimation problem, we assume that the processes y(t) and w(t) can be represented as outputs of an LTI system in forward innovation form: Assumption 1 The processes y(t) and w(t) can be generated by a stochastic discrete-time minimal LTI system on the form where and e g are stationary, square-integrable, zero-mean, and jointly Gaussian stochastic processes.The processes x and e g are called state and noise process, respectively.Recall, that stationarity and square-integrability imply constant expectation and that the covariance matrix only depends on time lag (t − s).Furthermore, we require that A g is stable (all its eigenvalues are inside the open unit circle) and that for any x T (t − k)] = 0, i.e., the stationary Gaussian process e g (t) is white noise and uncorrelated with x(t − k).We identify the system (7) with the tuple (A g , K g , C g , I, e g ); note that the state process x is uniquely defined by the infinite sum Before we can continue we have to consider the relationship between y and w.For technical reasons we can not have feedback from y to w, as w would then be determined by a dynamical relation involving the past of the process y.As such we have Assumption 2 Assumption 2 There is no feedback from y to w, following definition 17.1.1.from [12], i.e., holds, i.e., the future of w is conditionally uncorrelated with the past of y, given the past of w.
As a passing remark, the no feedback assumption is equivalent to weak feedback free assumption [2] or Granger non-causality [9].Thus the no feedback assumption can be stated as y does not Granger cause w.

Result
Under assumption 2, there exists a similarity transformation T of ( 7) such that Āg = T A g T −1 , Kg = T K g and Cg = C g T −1 are upper block triangular, specifically (7) can be represented as where [e T 1 (t) e T 2 (t)] T = e g (t), and such that ( The optimal estimate ŷ(t) = E[y(t) | H − t+1 (w)], in the least square sense, is then given as the output of the following LTI system with the covariance Q = E[e T g (t) e g (t)] partitioned according to (8).

Derivation
Several results can be deduced from Assumption 2. First, by [12,Proposition 2.4.2],we obtain the following relation between projections Secondly, from [12, Ch. 17] it follows that the process y can then be decomposed into a deterministic part y d and a stochastic part y s , as follows Note that, as a consequence of ( 13) and ( 14) i.e., y d and y s are uncorrelated.Moreover, the process y s can be realised by a state-space system in forward innovation form Finally, from [12, Proposition 17.1.3.]we get where ⊕ denotes orthogonal sum, and Now consider a similarity transformation T of ( 7) such that Āg = T A g T −1 , Kg = T K g and Cg = C g T −1 are upper block triangular, see (8).From [10] it then follows that (A 2,2 , K 2,2 , C 2,2 , e 2 ) is a minimal Kalman representation of w hence e 2 (t) is the innovation process of w i.e., Moreover, the transformed system (8) induces a relation between the output y and input w.In detail, from (8b) we also have Hence, substituting (19) in (8) yields the following realisation of y Note that e 1 (t) is the innovation process of y (with respect to w), i.e.,

Optimal estimate
The goal in this section is to derive an optimal estimate (in the sense of minimum error variance where H(e 2 (t)) = {α T e 2 (t) | α ∈ R q }, is the space spanned by innovation process e 2 (t), considered only at the time t.By definition we have However, using definition of e 2 (t) from (18) we have and therefore, using (21) we can see that where (34) follows from (17).Now (31a) can be used to derive a dynamical expression for xg as follows Clearly E[w(t)|H − t+2 (w)] = w(t).For the state projection in (35 ] since the state vector x(t) can be expressed as an infinite sum using (31a), where (17) is used for e s (t) Finally, from (15c) we observe that since H(y s ) ⊥ H(w) by ( 14).Finally we have obtained (9) the formula for the minimum prediction error variance estimate of y(t) based on H − t+1 (w) (present and past of process w).
An explicit construction of a realization of y with input w has the potential to provide an alternative to existing system identification algorithms.There are many subtleties in the consistency analysis of subspace identification algorithms with inputs [12,11,5], so an alternative approach involving the identification of an autonomous model of (y, w) could be advantageous in some cases.
Before moving on to the numerical example we mention that the proposed construction is very useful when trying to learn estimators from data.If one has some prior knowledge about some part of the generating system (7), then the results of the paper can be used to easily construct a parameterisation of the estimator (9).Secondly, there are many subtleties in the consistency analysis of subspace identification algorithms with inputs [12,11,5], so an alternative approach involving the identification of an autonomous model ( 7) of (y, w) could be advantageous in some cases, i.e., estimating the data generator (7) via autonomous system identification, and then using the results of the paper to obtain the estimate of y, is more consistent.That is, with the same amount of data, the estimated predictor will have better performance, in the sense of lower validation mean square error.

Computational Example
The following examples' code is available on GitLab [6].To illustrate the findings consider the system x(t + 1) = 1.08 −0.23 0.58 0.27 The system (37) is such that w is feedback free from y, later we will find the upper block diagonal form of the innovation process of (37).Before that, we first find the forward innovation form of system (37) by following [12,Chapter 6], for the sake of completeness we reiterate statements of [12].Note [12] assumes v(t) in (37) to be normalised Gaussian.First find P , which solves Lyapunov equation then find Π, which solves the following algebraic Ricatti equation The triangular form is obtained by applying SVD to the observability matrix of (A g , C w ).From SVD, contrary to standard practice, we sort the singular values from lowest to highest, and appropriately sort the columns of the matrices U and V containing the left and right singular vectors respectively.Using the transformation T = V T = −0.71−0.7 −0.7 0.71 , we obtain a system in triangular form as (8).
The system (46) produces the least square estimate of y(t), at least when the generating system is fully known.Furthermore, the results of the paper are also useful for system identification.

System identification example
The proposed construction could be useful in system identification for parametric methods, since apriori knowledge of the generating system (7) can easily be translated to information about the estimator.For example, say we know (A g , K g , C g , Q), except for one element of A g , i.e., A g1,2 = θ, then the parameterised generator is given by x(t + 1) = 0.85 θ 0 0.
Then θ can be found by minimising the mean square error (MSE) on some collected data, i.e.,  1.For a given N , Multiple trajectories have been generated, so we have a set of M data sets SN,i with N samples each, then we try to estimate the predictor from each data set SN,i using the 4 approaches discussed above, and compute the average training MSE and average validation MSE.Note: training MSE is computed only for those N data points that have been used to estimate the predictor, whereas validation MSE shows the true predictive power of the predictor.
Since the system in this example y has 3 components, Table 1 reports "Variance Accounted For" for each component, i.e. with y i,m , ŷi,m as short-hand for the ith com-ponent of trajectories y(ω m ) and appropriately ŷ(ω m ).this shows how well each component of y is estimated, for example using 150 samples for identification we can at best estimate 16.6% of the first component's variance.Whereas, if we had known the system (Case 0) we should be able to estimate 22.7% of the first component.If, in this hypothetical example, we had collected N = 1000 samples for identification, then we could estimate 22.3% of the first component, much closer to best possible estimation (Case 0).We also report the average over components VAF, i.e.In summary if we collect enough samples to avoid overparameterisation, then identifying the generator, and then applying the construction as defined in Section 3, yields better performing estimators of y.

Table 1
Summary of the results.Best values marked in bold.All quantities are averaged over multiple system identifications with varying data, as explained in (50)