A scalable multi-step least squares method for network identification with unknown disturbance topology

Identification methods for dynamic networks typically require prior knowledge of the network and disturbance topology, and often rely on solving poorly scalable non-convex optimization problems. While methods for estimating network topology are available in the literature, less attention has been paid to estimating the disturbance topology, i.e., the (spatial) noise correlation structure and the noise rank in a filtered white noise representation of the disturbance signal. In this work we present an identification method for dynamic networks, in which an estimation of the disturbance topology precedes the identification of the full dynamic network with known network topology. To this end we extend the multi-step Sequential Linear Regression and Weighted Null Space Fitting methods to deal with reduced rank noise, and use these methods to estimate the disturbance topology and the network dynamics in the full measurement situation. As a result, we provide a multi-step least squares algorithm with parallel computation capabilities and that rely only on explicit analytical solutions, thereby avoiding the usual non-convex optimizations involved. Consequently we consistently estimate dynamic networks of Box Jenkins model structure, while keeping the computational burden low. We provide a consistency proof that includes path-based data informativity conditions for allocation of excitation signals in the experimental design. Numerical simulations performed on a dynamic network with reduced rank noise clearly illustrate the potential of this method.


Introduction
Dynamic networks represent large-scale interconnected systems, and data-driven modeling of dynamic networks has received considerable attention in recent years.These networks can be considered as a set of measurable (node) signals interconnected through linear dynamic systems (the modules), driven by measured external excitation signals and/or unmeasured disturbance signals.Modeling of these networks plays an important role in biological systems [20,22], economic systems [26], power networks [30], and many other fields in science and engineering.The challenges addressed in identification of dynamic networks can roughly be divided into three categories.The first is identifying the interconnection structure of the nodes in a dynamic network referred to as network topology detection [8,26].The second is the identification of a specific module in a network, referred to as local module identification.For this problem closed-loop identification methods have been generalized to the dynamic network situation in [35], formulating the local module identification problem as a multi-input-single-output (MISO) problem.This has been further extended and generalized in e.g., [10,12,16,19,28,31,32,37]. The third challenge is identification of the full network dynamics [9,17,41,44], where the problem is formulated as the identification of a (structured) multi-input-multi-output (MIMO) model.
In this paper we will further explore the development of full network identification methods.While dynamic networks increase in complexity and size, and measurement data is becoming increasingly accessible, there is a strong demand for accurate and scalable data driven modeling methods.The joint direct method [42,44] predicts all node signals in the network jointly and achieves consistency and minimum variance properties in the situation that the network and disturbance topology are given a priori and the noise can be of reduced rank.However it strongly relies on solving (constrained) non-convex optimization problems, which seriously limits its scalability to larger networks.There are multi-step convex identification methods available for full network identification, such as the Sequential Linear Regression (SLR) [9], Sequential Least Squares (SLS) [41] and extensions of Weighted Null Space Fitting (WNSF) [18] such as [17].Moreover, methods such as the SLR and SLS allow for splitting the MIMO optimization into multiple linear regressions, which contributes to a lower computational burden.The available convex methods are scalable to larger networks, but are limited to particular model structures of the network, and additionally, they do not allow for handling reduced rank noise.Particularly in large-scale network identification, stepping away from the typical assumption that all disturbance signals have their own independent noise source, is an appealing situation that should be supported by an effective estimation algorithm.Handling this situation of reduced-rank noise can substantially reduce the variance of estimated models.However it also introduces the problems of estimating the noise rank and noise correlation structure from data.
All available convex and non-convex methods for network identification require prior knowledge on the topology (i.e.rank and spatial correlation structure of the disturbance model).While in dynamic factor analysis [14] attention has been paid to the estimation of noise rank, in prediction error identification this does not appear to be included yet in the identification algorithms.For situations where the disturbance topology information is not readily available, it is attractive to develop methods that include estimating this information from data.
The topology estimation literature shows a variety of available methods to estimate the topology, such as Wiener filter based methods [26,27,29], Bayesian model selection techniques [8,34,40], or methods that infer the topology from parametric estimates [3,11,46].While the main focus of topology detection literature has been on estimating network topology in the situation of a diagonal disturbance spectrum Φ v (ω), extensions towards nondiagonal spectra have been presented in [4,15,39].In [39] network topology and the non-zero pattern in the disturbance spectrum are estimated jointly.In this paper we assume that we do not know the disturbance topology a priori, but we assume that the network topology is known e.g., from its underlying physics, which is commonly the case for engineered systems.In the situation that the network topology is not known beforehand, it is possible to use any of the above cited methods to estimate it.We allow the process noise to be spatially correlated, i.e. the disturbance spectrum Φ v (ω) is not necessarily diagonal.Additionally the noise is allowed to be of reduced rank, i.e.Φ v (ω) can be singular.
The objective is to develop a multi-step convex algorithm that estimates the disturbance topology and the dynamic modules in the network for general model structures including the Box Jenkins (BJ) structure, while adhering to computational algorithms that are scalable, while achieving favorable properties in terms of low experiment cost, consistency and reduced variance of the network estimates.
To this end we develop a multi-step algorithm to identify the network dynamics.In the first step the noise rank and the nonzero pattern in the corresponding disturbance model (noise shaping filter) are estimated.This is done through a (nonparametric) high-order ARX model, inspired by the SLR method [9].Next, this information is used to develop a multi-step convex algorithm that can accurately identify the dynamics of the network in the situation of reduced rank noise and for a very general Box Jenkins model structure, thereby combining the recently introduced multi-step convex identification methods SLR [9] and WNSF [17,18] and extending them to the described situation.
The paper proceeds with a definition of the considered dynamic network setup in Section 2. In Section 3 we present a new method for estimating the disturbance topology from data, followed in Section 4 by a multistep identification algorithm that exploits the prior estimated disturbance topology.Section 5 presents the consistency analysis of the method, including graph-based conditions for data informativity.Results of numerical simulations are provided in Section 6, followed by conclusions in Section 7. The consistency proofs are collected in the Appendix.

Dynamic networks
Following the setting of [35] a dynamic network is defined by L nodes or internal variables w j (t), j = 1, . . ., L, that are scalar-valued measured signals.The underlying network is linear time invariant (LTI), and the nodes of the network can be expressed as where • q −1 the delay operator, i.e. q −1 w j (t) = w j (t − 1), • N j defines the set of indices of measured node signals w l , l = j, for which G 0 jl (q) = 0, where G 0 jl (q) is a strictly proper rational transfer function, • R j defines the set of indices of measured external excitation signals r k , for which R 0 jk (q) = 0, where R 0 jk (q) is a known proper rational transfer function, • v j (t) is unmeasured process noise, where the disturbance vector is modeled as a wide sense stationary stochastic process represented by v(t) = H 0 (q)e(t).The e = [e 1 • • • e p ] is a white noise process of dimension p ≤ L with covariance matrix Λ 0 > 0. H 0 (q) is a rational transfer function matrix.
The full network expression, with omitted q and t, is (2) with the matrix notation given by where we assume that the inverse (I − G 0 ) −1 exists and the network is well-posed, as used in [35].In the situation p < L, i.e. when the noise is of reduced rank or singular, the disturbance model H 0 is a nonsquare matrix, i.e.
For a unique representation of reduced rank spectra that can be used to construct a predictor we can adopt a result from [44] where the disturbance term is equivalently written as H0 ȇ with H0 square.

Lemma 1 ( [44]
) Consider an L-dimensional disturbance process v with rank p. Then the disturbance signals v can be reordered in such a way that the following unique representations result: is a stable proper rational transfer function matrix.
• The covariance matrix of ȇ is given by, where Λ 0 ∈ R p×p has rank p. 2 • If additionally H 0 a is minimum phase then H0 is monic, stable and minimum phase. 1   The result of the reordering of signals as indicated in the Lemma is that the first p components of the reordered signal constitute a full rank p process.
We assume that the data generating network satisfies the following properties.

Assumption 1
a.The network is well-posed, i.e. all principle minors of I − G 0 (∞) are nonzero [1].b. (I − G 0 ) −1 is stable and causal.c.All elements in G(q) are strictly proper.d.H 0 is stable and has a stable left inverse.e. H0 is square, monic and minimum phase.f.The topology of G 0 and R 0 , and the non-zero elements of R 0 are fixed and known.g.The matrix R 0 has a block diagonal structure: R 0 = diag(R 0 a , R 0 b ) in the situation of ordered nodes as meant in (4).h.Measurements of all node signals w and all present excitation signals r are available.i.The standard regularity conditions on the data are satisfied that are required for consistency results of the prediction error identification method. 2he two main steps of the identification method that will be developed in this paper are • Estimating the disturbance topology, i.e. the noise rank and the zero pattern in the disturbance model.• Estimating the dynamical components in the network for a given network and disturbance topology, while using a parametric BJ model structure.
In the next section we first focus on the disturbance topology estimation method, followed by the developed identification method in the section thereafter.

Disturbance topology estimation
Before we can use a unique disturbance model that is structured according to H0 in (4), we need to estimate the noise rank p and we need to be able to reorder the node signals in such a way that a noise representation as in (4) can be used.This step is necessary as the unstructured disturbance model H 0 is non-unique in the situation p < L. Therefore the disturbance topology estimation is performed in two main steps: • A parametrized ARX model is chosen according to while all coefficients of Ȃk , Bk are vectorized and collected in the parameter vector ζ.The one-step-ahead predictor [24], defined as where w t−1 and r t are defined according to w t−1 := {w(0), w(1), • • • , w(t−1)} and r t := {r(0), r(1), with ϕ(t) composed of the appropriate terms in w and r.
Note that for an actual network with representation G 0 , H0 , R 0 , the one-step predictor will be given by This implies that the polynomial predictor model (9) can only accurately approximate the rational filters that are present in (11) if the ARX order n is chosen very high.The ARX model is estimated according to ζn Since the network identifiability conditions of [43] are satisfied for the considered model set, the sample estimate will then, under mild regularity conditions, be a consistent estimate of the noise covariance Λ0 .The rank p of the noise process can then be estimated through a rank test on Λ, e.g., through a singular value decomposition.Alternatively, other matrix factorizations or information based criteria can be applied for estimating the rank, see e.g., [6].When Λ and the estimated rank p < L have been determined, the L signals can be reordered through a permutation matrix Π such that the first p components of the permuted noise vector have a rank p covariance matrix, i.e.I p 0 Π ΛΠ I p 0 has rank p.
Remark 1 Since the polynomials Ȃ(ζ) and B(ζ) are fully parametrized with independent parameters on each polynomial entry, the MIMO least squares optimization that leads to the solution (12) can also be decomposed in L separate linear regressions that minimize the residual ε j (t, ζ) separately for each j, which is computationally attractive since the computations can be performed in parallel or sequentially.

Remark 2
The resulting estimation scheme will generally not provide us with consistent estimates of the ARX model.This is not only due to the fact that typically the order n of the ARX model would need to go to infinity, but also to the fact that the solution for ζn N is non-unique in the situation p < L. However, this latter non-uniqueness does not affect the uniqueness and whiteness of the residual ε(t, ζn N ) since, according to the projection theorem, every solution for ζn N determines the same predictor [13].The estimate Λ is therefore consistent, i.e.Λ = cov(ȇ) w.p. 1 as n, N → ∞.
Remark 3 Although a correct estimation of the noise rank p cannot be guaranteed, consistency results for estimating p would be possible when applying informationbased criteria for rank estimation, e.g., based on the BIC criterion [6].In the next steps of our approach it will be assumed that a correct estimation of p has been obtained.
After reordering the node signals as described above, we can now adhere to a network representation with a unique disturbance model according to the structure in Lemma 1, where H0 can be parametrized by the transfer function matrices H a and H b .

Step 2: Estimating the noise correlation structure
In the second step we are going to estimate which entries in our disturbance model are nonzero.To this end we extend the SLR method [9] to the situation of reduced rank noise and show how the noise correlation structure can be obtained.

Step 2.1: Refining the nonparametric ARX model
With the noise rank p available and the nodes being ordered, we have gained additional information on H0 (4), namely the last L − p columns are now known.Now, we perform the same approach of identification using high order ARX modeling as in the previous step, but by utilizing the known entries in H0 , leading to refined estimates of Ȃ( ζn N ) and B( ζn N ).In the analysis results of Section 5.1 it shown that the known entries in H0 can simply be mapped to known entries in the parametrized polynomial B(ζ), and therefore can simply be taken into account in the least squares problem (12).In Section 5.1 it is shown that this leads to consistent estimates ζn N for n, N → ∞.

Step 2.2: Predictor model with reconstructed innovation input
In this step we are going to use the estimated nonparametric ARX model to reconstruct the innovation signal.This allows us to use the reconstructed innovation signal as a measured input in the predictor model that will be used for estimating the structure of the disturbance model.
If there exists a parameter ζ 0 such that the ARX model ( Ȃ(ζ 0 ), B(ζ 0 )) captures the dynamics of the network, then it follows from [44] that We can accordingly decompose ε(t, ζ) as while the consistency property of ζn N implies that We will refer to ε(t, ζn N ) as the "reconstructed innovation".
For a network with ordered nodes we evaluate a new one-step-ahead predictor that includes the innovation signal e t−1 := {e(0), e(1), • • • , e(t − 1)} in the expectation.Then it follows that ŵ(t|t − 1) = G 0 (q)w(t) + ( H0 (q) − I)ȇ(t) + R 0 (q)r(t), ( 18) where This motivates the use of the following parametrized predictor model per node: where the terms G(η) and H(η) are parametrized versions of G 0 and H0 respectively, and ε a ( ζn N ) is an estimate of the noise signal e(t).G jl (η) = n k=1 g jl k q −k and Hjs (η) = n k=1 h js k q −k are parametrized as strictly proper polynomials of order n, the term k∈Rj R jk r k (t) is known, the sets N j and R j are known from the topology of G 0 and R 0 , and V j defines the set of indices of noise signals for which noise dynamics is present in the disturbance model.This leads to an ARX model, like in Step 1, but now with the reconstructed innovation ε a (t, ζn N ) added as external predictor input signal, and the coefficients of the unknown polynomials collected in the parameter vector η.It is our next objective now to determine the sets V j for j = 1, • • • , L. To this end we follow two approaches namely the structure selection approach and the Glasso approach, which will be presented next.

Structure selection
For a particular choice of V j we evaluate the residual ε j (t, ηn Nj ) := w j (t) − ŵj (t|t − 1, ηn Nj ) where ηn Nj is the estimated parameter that minimizes the quadratic criterion 1 N N t=1 ε 2 j (t, η j ), and that is obtained through an analytical solution, similar to (12).We test this residual with possible combinations in set V j and employ model selection techniques such as AIC, BIC and Crossvalidation (CV) on the obtained estimates ηn Nj [46], of which the BIC provides a consistent estimate [23,33].Because we use ARX models to estimate η, model selection techniques such as AIC, BIC and CV are convex.Additionally, since we derive the disturbance topology per node, we have to test at most 2 L possible sets V j for L nodes.This results in a lower computational burden compared to when we detect the topology in a MIMO setting, where we would have to test at most 2 L 2 −L possible sets V j simultaneously for all j [46].However, for large networks these model selection techniques can still become computationally heavy.

Sparse estimation with Glasso
For each node j, a Glasso (Group Lasso) estimate is computed by minimizing the following cost function over η j for a fully parametrized disturbance model with p white noise inputs: with the one-step-ahead predictor (20), and η j being the vector of parameters related to the modules G ji for i ∈ N j , and related to the modules Hjs for s = 1, . . ., p; λ j is the tuning parameter (penalization factor) of Glasso.The tuning of λ j is described in the numerical illustrations in Section 6.The right hand side of ( 21) is a mixed l 1 /l 2 norm.The Glasso estimate is a convex extension to lasso that penalizes groups of estimated parameters [45], imposing sparsity at group level.Within a group, it does not yield sparsity [2].If an appropriate penalization factor is chosen, only the dynamic modules that are actually present in the data generating network remain while the nonpresent terms are forced to 0, thus providing an estimate of the structure of H.
With either of the methods of Sections 3.2.3 or 3.2.4 the structure V j of the disturbance model can be estimated entirely with convex and thus scalable methods, employing nonparametric (high-order ARX-) models.This structural information can be effectively used in the actual estimation of parametric dynamic models in the next Section.
Remark 4 It is possible to add regularization when estimating the high-order ARX models presented in this section to guarantee stability of the estimates.

Estimating parametric network models
The next step in our identification procedure is

While in
Step 1 and 2 high-order (nonparametric) models of the same model order n are used, and thus providing estimates with relatively high variance, in this step a parametric model is estimated from data where we exploit a very flexible Box-Jenkins model structure.In Step 3 we extend the WNSF method [18], and its application to dynamic networks in [17], to the reduced rank noise case such that we are able to obtain parametric models G(θ) and H(θ).The WNSF is in itself a three step method that starts with a high-order model before estimating the parametric model.

Step 3.1: Refining the nonparametric model
By fixing the correctly estimated disturbance topology obtained in the previous section we obtain consistent estimates of η j using one-step-ahead predictor (20) defined in (17), leading to a high-order ARX model with structured disturbance model.The conditions for consistency of ηn j N are derived in Section 5.By employing the structured disturbance model we reduce the variance of ηn j N , while the model order n remains the same.Using the consistent estimate ηn j N , we update the reconstructed innovation.Subsequently, we again update the high-order ARX model by replacing ε a ( ζn j N ) with the updated reconstructed innovation ε a (η n j N ) in (20), and use this updated predictor to re-estimate η j .This latter estimate can be seen as the starting high-order model for the WNSF method.At this point we still have a high variance on the estimates of η but negligible bias if model order n throughout all the steps is chosen sufficiently large.In the next step we reduce the variance by reducing the number of parameters to estimate, where we will make the step from a high-order (nonparametric) model to a parametric model.

Step 3.2: Parametric model estimate
On the basis of the nonparametric model estimate characterized by ηn j N we are now going to estimate a parametric model of the dynamic network by utilizing a Box Jenkins model structure: that can be rewritten as From G jl (η n j N ) and Hjs (η n j N ) that are obtained in the previous step through the predictor (20), we can derive a related estimate of H 0 (q) according to (19) leading to , with Γ(η n N ) an estimate of the direct feedthrough term Γ 0 of H 0 b , and that based on the relation ȇb (t) = Γ 0 ȇa (t) from (4), can be given by .
(24) Following the WNSF approach, we are now going to fit the parametric Box Jenkins model to the nonparametric model estimated from Step 3.1, by solving for θ in the equations However, since these equations can not be solved exactly, an optimization problem is formulated [18] that comes down to minimizing the quadratic residual vector on the equations ( 25) by solving (in node-wise notation): where with Q g j and Q h j diagonal matrices with entries with model orders m i , i ∈ {l, f, c, d} according to (22), the top left corner of Īn×m is I m×m and has zeros otherwise, and T n×m [X ji (q)] is a lower triangular Toeplitz matrix where the first column is (26) is solved in first instance through the analytical least squares solution However, a parameter estimate with smaller variance can be achieved if a weighted least squares criterion is applied 3 .This is introduced in the next step.
3 As an alternative we can consider a weighted least squares criterion to obtain θ[0] j N (29), with the covariance matrix of the nonparametric model as weight.

Step 3.3: Re-estimation of parametric model
In this step we reduce the variance further by reestimating the obtained parametric models G(θ) and H(θ) defined in (23).For a statistical optimal solution of (26), instead of the standard least squares problem (26), a weighted least squares problem should be solved, where the optimal weight is given by the inverse of the covariance matrix of the residual ηn j N − Q j (η n j N )θ 0 j , with θ 0 j the actual network coefficients related to node w j .This is not directly applicable since θ 0 j is unknown.However it can be shown [18] that with η n0 j the real network coefficients related to the ηparametrized ARX model and T j (θ) a block diagonal matrix with the denominator polynomials as entries where T n×n [X ji (q)] is a lower triangular Toeplitz matrix where the first column is 1 Result (30) motivates the use of a weighted least estimator with weighting matrix the covariance matrix of the nonparametric model.This can be implemented in an iterative scheme according to θ[k+1] For consistency of the estimates of parameter vector θ we refer to the proof in the WNSF method [18], with the actual model orders m i with i = f, l, c, d (22) known.
Remark 5 Because in this final step we correct for the variance due to the modeling error (30), the final estimate will have a reduced variance.
Throughout the presented steps we split the MIMO optimization into L linear regressions that rely on explicit analytical solutions, and that allows for parallel computing.The Algorithm is given as follows.
Algorithm 1 Algorithm for full network identification in dynamic networks, including disturbance topology detection We continue to iterate until we have reached the convergence criterion N < 0.0001.This convergence criterion is also used in the simulation results in Section 6.In the next Section we derive the conditions required for consistency of estimates ζn j N and ηn j N .

Theoretical analyses
From here on we consider n = n(N ) i.e. the model order n increases as the data length N increases, while with increasing N , n/N tends to 0 with a particular rate [18,25].
Next we derive the conditions under which the estimates ζn N and ηn N , and consequently the reconstructed innovation are consistent.

Consistency of ζn N in Step 2.1: Refining the nonparametric model
With the noise rank p available and the nodes ordered we gained structural information on the unique noise model H0 (q) (4), namely we know that for the reduced noise rank case p < L the last L − p columns in H0 (q) are 0 I .Moreover, taking the inverse of H0 (q) does not affect the last L − p columns since As a result the term ( H0 (q)) −1 R 0 (q) in the one-step predictor (11), has the following structure with the second block column consisting of known terms only.This allows in the parametrization of the predictor (9) to replace the square polynomial B(ζ) with a nonsquare polynomial B(ζ), leading to with ϕ(t) composed of the appropriate terms in w and r a .Note that for an actual network with representation G 0 , H0 , R 0 , the one-step predictor is still given by (11), but now the predictor model (35) can use the known external excitation signals r b (t).The ARX model is estimated according to ζn (36) Note that Remark 1 holds and therefore predictor (35) can be decomposed in separate predictors for each node.The conditions for consistency are formulated in Proposition 1 and the proof is added in the appendix.
Remark 6 Condition (1) and ( 2) of Proposition 1 are given for all signals present in the network.These conditions remain unchanged when we convert from a MIMO predictor to L linear regressions.Therefore the proof also holds for a predictor assessed per node.
Proof : See appendix.

Consistency of ηn N in Step 3.1: Refining the nonparametric model
A refined nonparametric model is estimated by exploiting the information on the noise topology in the form of a structured polynomial model B(η j ) for Hjs (η j ) in the predictor (20), leading to the analytical solution (37) with ϕ(t) composed of the appropriate terms in w and ε(η n N ).The conditions for consistency are formulated in Proposition 2.
Proposition 2 Consistency ηn N Consider a dynamic network that satisfies Assumption 1 and Proposition 1, and assume the disturbance topology is estimated correctly.Additionally, consider the one-stepahead predictor (20) for all j.Then the transfer function matrices of G 0 (q) and H0 (q) − I are consistently estimated with the analytical solution ηn N (37), if the following conditions hold: (1) For all j, the spectral density Φ κ(ω) of κ(t) := , satisfies Φ κ(ω) > 0 for a sufficiently high number of frequencies ω.
Proof : See appendix.
Remark 7 Note that Condition 2 of Proposition 2 incorporates the condition that the noise rank p is chosen correctly, and the disturbance model is flexible enough to represent the exact disturbance topology of the network.
Following the line of reasoning in [36], the spectral conditions in Propositions 1 and 2, which are actually data informativity conditions, can generically be replaced by path-based conditions on the graph of the network model set.

Generic data informativity conditions
Condition (2) of Proposition 1 and Condition (1) of Proposition2 is a spectral data informativity condition on internal node signals in w, and it is difficult to interpret it for an experimenter.In this section we replace the spectral condition with a path-based data informativity condition in a generic sense4 , i.e. independent of the numerical values of the network dynamics.By doing so we can evaluate if data informativity is satisfied based on the network and disturbance topology, and the properties of the external signals.Next we formulate the conditions in terms of properties and locations of the external signals analogous to Lemma 1 and Proposition 1 from [36], by means of vertex-disjoint paths from external signals to internal node signals, where two paths are vertex-disjoint if they have no nodes in common, including their start and end nodes [38].The consequences are illustrated in a 6-node example.

Vertex-disjoint paths
The generic version of Condition (2) of Proposition 1 is given in Proposition 3.
Proof: See appendix.
Proposition 3 gives a sufficient generic path-based condition that requires to have external excitation signals at certain locations in the network, combining data informativity conditions with identifiability [36].
The set V denotes the set of indices of all the disturbing noise signals, where V j is a subset of V.For the generic condition for Condition (1) of Proposition 2 we introduce notation e {Xj } (t), where X j is the set of indices of all the disturbing noise signals excluding indices that are already present in set V j , i.e.X j = V/V j .
Proposition Remark 8 If we want to identify only the j th row of the network (or only part of the network), we can consider the predictor in Proposition 2 only for node j and satisfy the conditions in Proposition 2 and 4 for node j.
Next we elaborate the vertex-disjoint path conditions by means of an example where a network is subject to reduced rank noise.

Reduced rank noise example
We consider a 6-node network that satisfies Assumption 1 and is subject to reduced rank noise of rank p = 4 shown in Figure 1.This 6-node example is additionally used in the simulations in Section 6, and is further defined in Appendix E. The nodes are ordered such that the first p nodes are subject to full rank noise.Moreover, we assume the disturbance topology is correctly estimated.• w 1 (t) with w {N1} (t) = w 4 (t), there exists a vertexdisjoint path from e 2 (t) → w 4 (t).• w 2 (t) with w {N2} (t) = w 5 (t), there exists a vertexdisjoint path from e 3 (t) → w 5 (t).• w 4 (t) with w {N4} (t) = w 2 (t), there exists a vertexdisjoint path from e 3 (t) → w 5 (t) → w 2 (t) • w 5 (t) with w {N5} (t) = w 1 (t) w 6 (t) , there exist 2 vertex-disjoint paths from e 1 (t) → w 1 (t) and from e 4 (t) → w 6 (t).• w 6 (t) with w {N3} (t) = w 3 (t), there exists a vertexdisjoint path from e 3 (t) → w 3 (t).
In order to satisfy Proposition 4 we therefore do not require additional external excitation signals r k (t).Consequently, in order to identify the full network for the given example, it is sufficient to add external signals

Numerical simulations
In this section we show the results of different steps in Algorithm 1.We assume R 0 = I, and consider the system given in Figure 1 and Appendix E.
For the simulation study we use normally distributed zero mean white external signals, where {r(t)} has a variance of 5 and the vector of e-signals has variances {0.1, 0.2, 0.3, 0.4}.We simulate the nodes according to ) and perform M = 100 Monte Carlo runs over five data lengths logarithmically spaced between 300 and 50000.For each of the data lengths N a specific value of the model order n is chosen according to n = 10, 20, 30, 40, 40, for increasing values of N .The actual model orders m i , i ∈ {l, f, c, d} can be derived from Appendix E.
Next we describe the noise rank estimation results of step 1 of Algorithm 1.

Rank p and ordering of the nodes
In order to obtain the noise rank p we perform a rank test (singular value decomposition) on covariance matrix Λ (13).Finally with the noise rank p available we can reorder the nodes such that I p 0 Π ΛΠ I p 0 has rank p.
Next we show the disturbance topology detection results of step 2 of Algorithm 1.

Topology estimation of the disturbance model
For the topology detection we are interested in which indices belong in set V j for all j, where the indices indicate where the edges are located in the disturbance model.We evaluate the performance of the topology detection by evaluating the trade-off between overestimating and underestimating the number of edges, that is typically used in receiver operating characteristic (ROC) curves [21].
If an edge is present in both the data generating disturbance and the estimated disturbance topology, we count this edge as a true positive (TP).If an edge is present in the estimated disturbance topology but does not exist in the data generating system, we count this edge as a false positive (FP).Additionally we let P os indicate the total number of existing edges and N eg indicates the total number of non-existing edges in the disturbance model.The ROC curve plots the true positive rate (TPR) versus the false positive rate (FPR), with where FPR=0 and TPR=1 represented by the point (0, 1), indicates the topology is perfectly reconstructed.We evaluate the closeness to the point (0, 1) by utilizing the distance function For the structure selection procedure we test all possible combinations in set V j and employ AIC, BIC and CV.
For AIC we use with n pj the number of estimated parameters for node j and For BIC we use From these simulations we select set V j that gives the smallest AIC or BIC value.For the CV we split the data Z N = Z (1) Z (2) in a training set Z (1) of length 2 3 (N + 1) and obtain the estimates for the different combinations in set V j according to η(1) With the validation set Z (2) , that contains the remaining data of length N (2) = 1 3 (N + 1), we minimize objective function and select the set V j that gives the smallest root mean squared error (RMSE) j N , Z (2) ).( 46) For Glasso we fully parametrize the disturbance model, using the known topology of G 0 and fixed R 0 = I.We inspect all elements of the disturbance model matrix that is parametrized with the Glasso estimates (21).If element H ji (η N ) of the disturbance model matrix contains nonzero Glasso estimates we say this element contains dynamics, and therefore an edge is present and i ∈ V j .
To prevent arbitrary small Glasso estimates are seen as dynamics we define a tolerance, where the Glasso estimates are nonzero if the l 2 norm of these estimates is larger than 10 −3 .The choice to include the estimates of G jl (η) in the penalization is due to the implementation of Glasso [5].For good estimates on the disturbance topology, we utilize the known topology of G 0 and deal with known R 0 r(t) signals appropriately.
Tuning of λ j is done via a grid based search similar to the CV structure selection.First we select a grid λ grid j = {0, 25, 50, • • • , 2000} containing λ j values to test.For each grid point we estimate ηgrid j using Glasso, from where the topology is derived by inspecting the disturbance model for dynamics as mentioned before, and fix the topology H grid j per node.Next we apply CV using topology H grid j and estimate the RMSE j .The grid point with the lowest RMSE j is selected as the λ j value.Repeating the tuning procedure over a number of runs gives the minimally required value for λ j .The tuning procedure is applied to all nodes for the different data lengths N .
Figure 2 shows the topology detection results, with the distance averaged over 100 Monte Carlo runs.The BIC is a consistent information criterion [23,33], meaning that the estimated disturbance topology will converge to the actual topology if N → ∞.However, as can be seen in the results in Figure 2, the full convergence of the BIC procedure is not reached for the given data lengths.Until the BIC procedure converges to the actual disturbance topology, it tends to underestimate the number of edges that are actually present, therefore the mismatch in the distance function is caused by not detecting all the TP's.The AIC is not a consistent information criterion, but has a faster convergence rate compared to the BIC [47].The AIC tends to overestimate the number of edges, meaning the mismatch is caused by detecting the FP's.The CV is comparable to AIC but has a slower convergence rate.Finally the Glasso seems to have the best of both AIC and BIC.However, these results heavily depend on the selected tuning parameter λ, where it is not guaranteed that a suitable λ exists.
Next we show the parametric estimation results of step 3 of Algorithm 1, where we fix the estimated disturbance topology.Based on the results in Figure 2 we have fixed the correctly estimated disturbance topology obtained with Glasso for N = 50000, where T P R = 1 and F P R = 0.

Estimating the parametric model
Next we present the results of the estimation of the parametric model.Because Algorithm 1 is consistent we have a negligible bias and the mean squared error (MSE) represents the variance.For the simulations we use the correct estimated disturbance topology from the previous step.Additionally, for Step 3.2 of Algorithm 1, we compute the θ[0] j N in (29) using the covariance matrix of   (32).As the data length N increases the number of iterations performed decreases, where for N = 50000 the simulations typically perform k = 2 iterations.The MSE(N ) improvement after the iterations is shown in Table 1.From Table 1 we can derive that we benefit most from iterating k in the final step of Algorithm 1 if we do not have full excitation on the network with R 0 = I.In Figures 3 and 4 we see convergence between the solid and dotted lines as the data length N increases.This indicates that as data length N increases the reconstructed innovation converges to the actual noise.Furthermore all MSE results continue to converge towards zero which is in line with the consistency proof.
The results of this simulation study support the consistency proof and we consistently estimate the BJ model structure, while employing a row-wise optimization.

Conclusions
In this paper we present a multi-step least squares method for network identification, that can handle reduced rank noise with low computational burden.We follow a step wise procedure where we first extend the SLR identification method to detect the disturbance topology, and thereafter extend the WNSF method to consistently identify networks of general model structure, including a BJ model structure.For a BJ network, usually a non-convex MIMO identification method is needed.In this paper, we show that we identify the BJ network using analytical solutions.Simulation results indicate that we can identify the disturbance topology of the given network with low error if the data length N is sufficiently large.We show that the presented method is consistent, and provide path based data informativity conditions, that guides where to allocate external excitation signals for the experimental design.Considering large networks subject to correlated and/or reduced rank noise, the presented method is promising due to its scalability and low variance results.

A Proof of Proposition 1
Consider the prediction error for the predictor ŵ(t|t − 1, ζ) from ( 35): (A.1)With the data generating system (1) given as 2) we can rewrite the prediction error as (2) Show that the global minimum is unique.
Step 1 By using the property that all ∆G-and ∆ Hterms are strictly proper, it follows from (B.  with J wa , J wb , J we appropriate transfer function matrices.Since ρ = r a r b e is persistently exciting, i.e. Φ ρ (ω) ≥ 0 for all ω, it follows from Lemma 1 in [36] that κ is persistently exciting if and only if matrix J has full row rank.Since full row rank of J is equivalent to a full row rank of [J wb J we ], the result of Proposition 1 in [36] then shows the equivalence with the condition that there are L vertex disjoint paths from the inputs of [J wb J we ], i.e. r b and e, to its outputs, i.e. w. with J wr , J wx , J wv appropriate transfer function matrices.Since ρ = r e {Xj } e {Vj } is persistently exciting, i.e.Φ ρ(ω) ≥ 0 for all ω, it follows from Lemma 1 in [36] that κ is persistently exciting if and only if matrix J has full row rank.Since full row rank of J is equivalent to a full row rank of [J wr J wx ], the result of Proposition 1 in [36] then shows the equivalence with the condition that there are Cardinal{N j } vertex disjoint paths from the inputs of [J wr J wx ], i.e. r and e {Xj } , to its outputs, i.e. w {Nj } . 2

E System used in simulations
In the simulation results in Section 6 we use the data generating network of which the graph is represented in

Fig. 1 . 6 -
Fig. 1. 6-node dynamic network with reduced rank noise that has rank p = 4, no r(t) signals are shown.The arrows represent the edges for which G 0 ji = 0 and H 0 ji = 0, where the arrows indicated in red are examples of the two vertex disjoint paths needed to satisfy Proposition 4 for output w3(t) For data length N = 300, the singular values averaged over the 100 Monte Carlo runs are svd( ΛN ) = 0.37 0.26 0.21 0.06 2.13•10 −8 1.96•10 −9 , where we see that the last two singular values are close to zero.As data length increases the last two values converge even closer to zero.For N = 50000 we obtain the following averaged singular values svd( ΛN ) = 0.59 0.40 0.39 0.10 4.04• 10 −13 1.24•10 −13 , showing that a clear gap between the fourth and fifth singular value points to a correct rank estimate of 4.

Fig. 2 .
Fig. 2. dis as a function of N , averaged over the Monte Carlo runs.

Fig. 3 .
Fig.3.MSE between θN and θ0 as function of sample size, averaged over the Monte Carlo runs, obtained with Algorithm 1 with R 0 = I, where subscript {t} indicates the use of the true (unknown) white noise as a predictor input instead of the reconstructed innovation.

2 D Proof of Proposition 4
Similar to the line of reasoning in the proof of Proposition 3, the vector signal κ is written as
(20), network topology.Output: Disturbance topology, θN .Obtain consistent estimate ζn N with least squares solution(36), where the nodes are ordered and by utilizing the estimated noise rank p.2.2 Use the reconstructed innovation ε a (t, ζnN ) as measured input in the one-step-ahead predictor (20) defined in(17)to estimate the noise correlation structure.We use (i) Structure selection with AIC, BIC and CV, (ii) Glasso, applied to estimate ηn j N that is obtained with least squares solution(37).Refine the nonparametric ARX model and obtain consistent estimate ηn N with one-step-ahead predictor(20), where the estimated disturbance topology is fixed and update the reconstructed innovation to ε a (t, ηn N ) to re-estimate ηn j N ) in each iteration.