Time Series Modeling on Dynamic Networks

This paper focuses on modeling the dynamic attributes of a dynamic network with a fixed number of vertices. These attributes are considered as time series which dependency structure is influenced by the underlying network. They are modeled by a multivariate doubly stochastic time series framework, that is we assume linear processes for which the coefficient matrices are stochastic processes themselves. We explicitly allow for dependence in the dynamics of the coefficient matrices as well as between these two stochastic processes. This framework allows for a separate modeling of the attributes and the underlying network. In this setting, we define network autoregressive models and discuss their stationarity conditions. Furthermore, an estimation approach is discussed in a low- and high-dimensional setting and how this can be applied to forecasting. The finite sample behavior of the forecast approach is investigated. This approach is applied to real data whereby the goal is to forecast the GDP of 33 economies.


Introduction
Consider a vertex-labeled dynamic and weighted network with a fixed number d of vertices given by the set V = {1, . . . , d}. The weights are within the interval [−1, 1]. Such a dynamic and weighted network with a fixed number of vertices can be described by a time dependent adjacency matrix, here denoted by Ad = {Ad t , t ∈ Z}, where Ad t is [−1, 1] d×d -valued. If no edge is present at time t, a zero weight is considered. Thus, e i Ad t e j gives the weight of an edge at time t from vertex i to vertex j. It is considered that the network is driven by some random process, hence, the corresponding adjacency matrix process Ad is a stochastic process.
The vertices are considered as actors (e.g. people, countries), and a network could describe a relationship structure among these actors. A (weighted) edge between two actors describes some connection between them. The weights can be interpreted as the strength of the connection. For example, consider economies as actors where a possible relationship between two economies could be given by their relative trade volume at a given time point. Further examples are social networks, see [10,23]. The actors in such networks often possess attributes. These attributes can be static (e.g a person's name or birthday) or dynamic (e.g. personal income, time a person does sports, or political views).
In the example with the economies and their trade volume as a relationship between them, dynamic attributes of interest are macroeconomic measures such as inflation rate or gross domestic product (GDP). These attributes may be affected by the attributes of other actors, especially by actors with which the considered actor is strongly connected. In this work, the dynamic attributes are denoted by a time series X = {X t , t ∈ Z}. To simplify the notation, we focus on the case that each actor has only one attribute, meaning that X is ddimensional. Nevertheless, this framework can also handle multiple attributes per actor, see Section 2 for details.
In this work, the dynamic attributes are denoted by a d-dimensional time series X = {X t , t ∈ Z}, where each component of the time series is assigned to a vertex (actor) of the underlying network. In the social-economical literature, the influence of connected actors on the attributes is denoted as peer effect, see [8,20].
This work focuses on the dynamic attributes and not on the network itself. Consequently, this work is not about modeling a dynamic network. For modeling these dynamic networks, many models for static networks have been extended to the dynamic case as done by Hanneke et al. [9], Krivitsky and Handcock [15] for the Exponential Random Graph Models (ERGM), see Section 6.5 in [14], or by Xing et al. [38], Xu [39] for the stochastic block model (SBM). By contrast, this work gives a framework which models the dynamic attributes, that means modeling a time series on a dynamic network, in which the weighted edges influence the dependency of the time series. Knight et al. [12], Zhu et al. [42] have modeled these attributes for non-random edges, which mainly cover static networks. In the context of a static network, attributes can be considered as standard multivariate time series with additional information and can be modeled by using vector autoregressive (VAR) models with constraints, see Lütkepohl [19,Chapter 2 and 5]. However, VAR models have many parameters, which is why Knight et al. [12], Zhu et al. [42] focus on how to use the network structure to reduce the number of parameters so that high dimensions, meaning a large number of vertices, become feasible. By contrast, this work deals with a random network structure, and consequently, the process X cannot be modeled appropriately by using VAR models. Instead, we adopt a multivariate doubly stochastic time series framework, meaning two stochastic processes drive the time series. On the one hand, there is the innovation process of the time series. On the other hand, the coefficient matrices in linear processes or autoregressive models are stochastic processes themselves. Doubly stochastic time series models were introduced in Tjøstheim [31], Pourahmadi [27,28], and these authors assume that the two processes driving the time series are independent. That would mean the network could influence the attributes but not the other way around. This assumption could be too restrictive for most application. In the example where the GDP is the attribute and the trade volume defines the underlying network, the influence goes in both directions. To capture such behavior, we allow dependency between both processes, see Section 2 for details. Hence, the network, in form of the edges, can influence the attributes, and the attributes can influence the edges.
Knight et al. [11] extended the model of [12] to dynamic networks. Their model can be considered as a special case of the model considered in this paper, see Section 2 for details. They present an R-package to estimate their model, however, they present theoretical results only for the case of a static network such that the attributes can be written as a VAR model with constraints. Thus, they do not derive theoretical results for the dynamic case. This gap is filled in this work.
This paper is structured as follows. In Section 2, time series on dynamic networks are defined and some basic properties are given. In Section 3, the focus is on statistical results of network autoregressive processes, and their applications to forecasting are discussed. Some of the forecasting results are underlined by a simulation study which is given in Section 4. In Section 5, we apply this setup to forecast the GDP using the trade volume between economies as an underlying network. Proofs can be found in Section 7.

Time Series Modeling on Dynamic Networks
To elaborate, we first fix some notation. For a random variable X, we write X E,q for E|X| q 1/q , where q ∈ N; for a vector x ∈ R d , where e j = (0, . . . , 0, 1, 0, . . . , 0) denotes the vector with the one appearing in the jth position. Let 1 = (1, . . . , 1) be a vector of ones. For a matrix B, the absolute value evaluated component-wise is denoted by |B|. Denote the largest eigenvalue of a matrix B by ρ(B) and B 2 2 = ρ(BB ). The d-dimensional identity matrix is denoted by I d . Furthermore, for two matrices A, B the Kronecker product of A and B is denoted by A ⊗ B, see among others Appendix A.11 in [19]. Let A B denote the component-wise multiplication of A and B, i.e. the Hadamard product. sign(·) denotes the signum function, | · | + = max(0, ·), and they are evaluated component-wise for matrix arguments. Let I d;−I ∈ R (d−|I|)×d denote a d-dimensional identity matrix without the rows i ∈ I and I d;I = I d;−I C . An empty product denotes the neutral element, meaning 0 k=1 B k = I d . For a vector-valued times series {X t }, we write X t;r := e r X t , and for a matrix-valued time series {Ad t }, we write Ad t;rs := e r Ad t e s .
With this, we can define a network linear process as follows.
we denote the process given by X = {X t , t ∈ Z} a (generalized) network linear process (GNLP). Let p, q ∈ N and f j : . . , p, s = 1, . . . , q be measurable functions. A process X fulfilling equation (2) is denoted as a (generalized) network autoregressive moving average process of order (p, q) (GNARMA(p, q)) In this work, the focus is on the following network autoregressive process of order p given by where . . , p are some known measurable functions. Since (3) is a special case of (2), where the functions f j appearing in (2) have a particular form, we drop the term generalized. Note that the causal solution of (3) fits into the framework (1), see Lemma 2.2. Since for an adjacency matrix Ad ∈ {0, 1} d×d we have that e r Ad j e s gives the number of paths with length j from node r to node s, examples for G j are polynomials. E.g., if for some lag j the direct neighbors as well as the neighbors of these neighbors should have a direct impact, then G j can be chosen as G j (Ad tj ) = sign(Ad tj + Ad 2 tj ). Denote a vertex v as a k-stage neighbor of u if there is a path from u to v of length k but no shorter one. That means a direct neighbor is a 1-stage neighbor and a neighbor's neighbor which is not a direct neighbor a 2-stage one. Given an adjacency matrix Ad ∈ {0, 1} d×d , the k-stage neighborhood matrix is given by For direct edges, two natural concepts occur; the concept that the influence goes in the direction of the edge and vice versa. Definition 1 can handle both concepts. E.g., ifG j (·) = G j (·) , j = 1, . . . , p, is used in (3), one can switch between both concepts. If not specified otherwise, the concept that the influence goes in the direction of the edge is used in this work. That means the easiest function for G j in model (3) is given by G j (X) = X .
An NAR(p) model can be written as a stacked NAR(1) process in the following way. Let X t = (X t , X t−1 , . . . , X t−p ) . Then, the stacked NAR(1) process corresponding to (3) is given by are the corresponding matrices of the stacked NAR(1) process. The process It is also possible to handle more than one attribute at a time by simply enlarging Ad. For two different attributes we can replace Ad t by the following where B t and C t describe the (time-dependent) relationship between the different attributes. E.g, if there shall only be an influence between the different attributes of the same actor, we set B t = C t = I d , or if the different attributes shall influence each other in the same way as they are influenced by their own kind, we set B t = C t = Ad t . Model (3) is inspired by Knight et al. [12]. Let s j be the order number for lag j, which denotes the maximal stage neighbors included for lag j. Let Ad be a static adjacency matrix and denote by N (k) (r) = {j = 1, . . . , p : e r N k (Ad)e j = 1} the k-stage neighbors of vertex r. Then, for component i = 1, . . . , d, their autoregressive model is given in the following way Note that k-stage neighborhood sets are disjoint for different k. Thus, the above model fits into the framework (1) in the following way Knight et al. [11] extended the model (4) to dynamic networks with potential covariates and edge weights. Apart from the covariates, their extended model fits also in the framework (3). As mentioned in the case of a static network, model (3) can be considered as a vector autoregressive model with parameter constraints. For this case, Knight et al. [11] give conditions for stationarity and showed consistency of the least square approach. However, they give no theoretical results for the case of a dynamic network. In this work, the focus is on the dynamic case. Note that for this case model (3) cannot be considered as a VAR model with constraints anymore.
The condition that {ε s , s > t} and {Ad s , s ≤ t} are independent for all t ensures that (1) is a meaningful and causal representation. This condition allows that there could be an interaction between network Ad and X in a way that the network at time t can be influenced by {X s , s ≤ t}. That means an underlying network given by the adjacency process Ad has to fulfill only this condition, strictly stationarity and later on some dependence measure conditions on the dynamic behavior, but we assume nothing about the inner structure of the network. Thus, it does not matter if its a sparse or dense network, or if it has properties like the small-world-phenomenon. Hence, it gives the flexibility that the time series and the underlying dynamic network can be modeled separately. One is not fixed to a specific network model as it would be the case for a joint modeling approach. Instead, the idea is that the approach described here is used to model the time series X, and one of the several models for dynamic networks can be used to model the network Ad.
If Ad is a deterministic sequence, the GNARMA model is closely related to time-varying ARMA models, which, for instance, are used in the locally stationary framework; see [6,32]. Furthermore, if Ad is i.i.d., this framework reduces to the framework of random coefficient models, see for instance [25] and for the multivariate setting [24]. However, assuming independence between different time-points for the process Ad seems to be inappropriate in the framework of dynamic networks. Some form of influence of the recent history seems to be more reasonable, see among others [10].
Assumption 1 gives conditions which implies that (3) possesses a causal, stationary solution, see the following Lemma 2.2 for details. Assumption 1a) imposes only conditions on X, but no restrictions on the underlying dynamic network are required. If more about the underlying network is known, e.g. its weights and sparsity setting, one may work with Assumption 1b) which is more general but harder to verify without knowledge about the network. In Section 3.2, a simplified model is considered in which Assumption b) can be verified under simple conditions.
where for all t B t,j 2 ≤ |Ã| j 2 . The process has the following autocovariance function The autocovariance as well as the mean of X is affected by the dynamic behavior of the underlying network. In order to get a better understanding on how the dynamic dependency of the network affects the time series, the following Lemma 2.3 presents the autocovariance structure in the more simple case that {Ad t } and {ε t } are mutually independent. Note that Assumption 1 implies for model (3) the conditions (ii),(iii) of the following Lemma.
The latter term of the autocovariance function, ∞ j=0 ∞ s=0 Cov (B h,j µ, B 0,s µ), comes only into play for non-centered innovations and is driven by the linear dependency structure of the network. Consequently, it can be seen that the linear dependency of the network directly influences the linear dependency of the process X. As a consequence, even a network moving average process of order q may possess a nonzero autocovariance for lags higher than q. In order to better understand this, consider a small toy example with three vertices and two possible edges, (1,3) and (2,3), and only one edge is present at a time. Let {e t , t ∈ Z} be i.i.d. random variables with uniform distribution on [0, 1], i.e., e 1 ∼ U[0, 1]. Which edge is present at time t is given by the random variables (e t ) in the following way. If Ad t−1;13 = 1, then if e t > 0.05, then Ad t;13 = 1 else Ad t;23 = 1. If Ad t−1;13 = 0 (that means Ad t−1;23 = 1), then if e t > 0.95, then Ad t;13 = 1 else Ad t;23 = 1. Consequently, we flip in this network between the edges (1, 3) and (2,3), and if one edge is present at time t, it is more likely (with probability 0.95) that it is present at time t + 1 than flipping to the other edge. We have dependency between different time points as well as between edges. ε 1 ∼ N (µ, I 3 ), and µ = (10, −10, 0) . Let X be given by Thus, X is a network moving average process of order 1, and the influence goes in the direction of the edges. Since no edge goes into vertex 1 or 2, {X t;1 , t ∈ Z} and {X t;2 , t ∈ Z} are white noise. This can be also seen in the autocovariance function, which is displayed in its two parts in Figure 1. The left-hand side figures display the first part; ∞ s=0 E B h,s+h Σ ε B 0,s . The dependency of the network has no influence on the first part, thus, this part would remain the same if a static model was considered where Ad is replaced by its expected value. That is why this part of the autocovariance function has the structure one expects from a vector moving average (VMA) process of order 1. The right-hand side figures display the second part of the autocovariance function in Lemma 2.3, ∞ j=0 ∞ s=0 Cov (B h,j µ, B 0,s µ). As already mentioned, this part is completely driven by the linear dependence structure of the network. For the two edges, we have the following linear dependency: Cov(Ad t+h;23 , Ad t;23 ) = Cov(Ad t+h;13 , Ad t;13 ) = 0.9 h /4, Cov(Ad t+h;23 , Ad t;13 ) = Cov(Ad t+h;13 , Ad t;23 ) = −0.9 h /4. This explains the geometric decay in the autocovariance function of the third component of X, whereas the magnitude of the autocovariance function of the third component is mainly given by the difference of the mean of the innovations of the first two components. Hence, a greater difference of the innovations mean makes it harder to identify the linear dependency, which means the first part of the autocovariance function in Lemma 2.3, between components 1 and 3, or 2 and 3 respectively. In this particular example with mean µ = (10, −10, 0) , no linear dependency between the different components can be identified for moderate sample sizes. A sample autocorrelation function as well as a realization of the third component of X is displayed in Figure 2 for a sample size n = 500. Instead, looking from the perspective of the classical time series analysis, the sample autocorrelation function looks like we have three uncorrelated components where the first two components are white noises and the third could be an AR(1) process. Hence, this examples gives two important aspects to keep in mind: firstly, the linear dependency of the network can influence the linear dependency of the time series directly. Secondly, the problem that the autocovariance function may not suffice to identify network linear processes such as network autoregressive models should be kept in mind.  figure) of the autocovariance function (7) of process (8)

Statistical results
In this section we focus on the estimation of model (3). As seen in the example in Section 2, the autocovariance function is not helpful to identify such models.
Note further that even if Ad is Markovian, an NAR(1) process can generally not be written as a Hidden Markov model (HMM). This is because X given Ad is not a sequence of conditionally independent variables and cannot be written as a noisy functional of Ad t−1 only, which is required by a HMM, see among others [2] for details on HMM. Consequently, techniques used for HMM cannot be applied here. Instead, the same setting as in Knight et al. [12,11], Zhu et al. [42] is considered. Thus, the process X as well as the network Ad is observed leading to observations X 1 , . . . , X n and Ad 1 , . . . , Ad n−1 . In such a setting, the consistency of a least square estimate as well as asymptotic normality for model (3) is shown in the first subsection. The results are presented under general assumptions and the asymptotic setting that d is fixed and n → ∞. Later on, we give dependence measure conditions for the underlying dynamic network such that the general assumptions hold. In the second subsection, a simplified version of model (3) is considered. This simplified model is suited for high-dimensional cases, and consequently, the theoretical estimation results are presented nonasymptotically.

Network autoregressive models
Networks usually come together with some form of sparsity, see among others Section 3.5 in [14]. This means that a vertex has only a connection to a limited number of other vertices and E sign(Ad 1 ) could have some zero entries or might even be a sparse matrix. Thus, E sign(G j (Ad 1 )), j = 1, . . . , p might be sparse matrices as well. That means the number of parameters of model (3) is given by p j=1 vec(E| sign(G j (Ad 1 ))|) 0 ≤ pd 2 and depends on the sparsity of the underlying network. Let I(r) = {ĩ = i + (j − 1)d, i = 1, . . . , d, j = 1, . . . , p : 1/n n t=p+1 e r |G j (Ad t−j )|e i > 0}, r = 1, . . . , d be a set of indices corresponding to the non-zero coefficients of n t=p+1 |(G 1 (Ad t−1 ) : · · · : G p (Ad t−p ))| and I(r) E = {ĩ = i + (j − 1)d, i = 1, . . . , d, j = 1, . . . , p : Ee r |G j (Ad 1 )|e i > 0}, r = 1, . . . , d is the corresponding population quantity. Note that I(r) ⊆ I(r) E for all t. Only parameters corresponding to indices of the set I(r) E are well defined in the sense that they have an influence on the process. We set the other parameters, meaning those corresponding to indices of the set (I(r) E ) C , to zero. Recall that I d;−I ∈ R (d−|I|)×d denotes a d-dimensional identity matrix without rows i ∈ I and I d;I = I d;−I C . Let for r = 1, . . . , d, w r = I dp;I(r) (e r A 1 , . . . , e r A p ) and Then for t = p + 1, . . . , n and r = 1, . . . , d, we can write (3) as Thus, w r and µ r = Eε t;r can be estimated by using the following least square approach given by argminŵ For component r, this leads to the following linear system: We show the consistency of the least square estimators under general assumptions. Later on, we specify a dependence concept for the underlying network which ensures these general assumptions, see Lemma 3.2.
Assumption 3. For all r = 1, . . . , d, we have, as n → ∞, Furthermore, we have for r, s = 1, . . . , p, as n → ∞, The results of Theorem 3.1 can be used to forecast the process X. If Ad n is observed, then let Y (r) n = I dp;I(r) ((e r G 1 (Ad n )) X n , . . . , (e r G p (Ad n−p+1 )) X n−p+1 ) and a one-step ahead forecast of X n+1 is given byX If Ad n is not observed, Ad n itself needs to be predicted first. This could be done by fitting a dynamic network model to Ad 1 , . . . , Ad n−1 and using this model to predict Ad n . An h-step ahead forecast can be done recursively, which means performing a one-step ahead forecast based on the observations and the results of the h − 1, . . . , 1-step ahead forecasts.
Assumption 2 mainly requires a √ n conversion rate of the first and second sample moments of X. An absolutely summable autocovariance function of X is sufficient for the convergence of the first sample moments. As pointed out in Lemma 2.2 and Lemma 2.3, the autocovariance of X depends on the dependency structure of Ad. For simplicity, consider the following network moving average process Cov(Ad h µ, Ad 0 µ). Hence, even for this simple moving average process, a summmable autocovariance function can be obtained only if Ad possesses some sort of short-range dependence. Several general dependency concepts exists which could describe a short-range dependency structure such as mixing [3], some weak dependency concepts [7] or physical dependence [33,35,37,17]. Since the concept of physical dependence works well also in the high-dimensional case, see among others [40,41], this concept is used to quantify the dependence structure of Ad.
To elaborate, let {ξ t } be a sequence of i.i.d. random vectors of dimensioñ d such that {Ξ t = (ε t , ξ t )} is also an i.i.d. sequence. Furthermore, let Ad t = H(Ξ t , Ξ t−1 , . . . ), where H is some measurable function to [−1, 1] d×d . Denote by Ξ t an i.i.d. copy of Ξ t and let for some The process Ad is denoted as q-stable if ∆ q (Ad) < ∞. This property still holds for some nonlinear transformations, see [33,34]. E.g. consider a polynomial transformation given by {g Without assuming any sparsity, an upper bound is given by C ≤ d. This dependency concept covers a wide range of processes among them many nonlinear time series, see [33,35,37] for examples. Furthermore, this concept includes nonlinear Markov chains, meaning Ad can be given by Ad t = H(Ad t−1 , Ξ t ). Zhang et al. [41] pointed out that a stable process is obtained if H possesses some form of Lipschitzcontinuity. Then, δ q (Ad, j) = O(ρ j ) for some ρ ∈ (0, 1), see Example 2.4 in [41] or Example 2.1 in [5] for details. A stable vector autoregressive process possesses also such geometrically decaying physical dependence coefficients, see among others Example 2.2 in [5]. Note that many dynamic network models, e.g. Temporal ERGMs [9], are Markovian.
Lemma 3.2. If Assumption 1a) holds, Ad G = { G(Ad t )} is 2q-stable, and max r ε 0;r E,2q < ∞ for some q ≥ 1, then X is q-stable. If the above conditions hold for q ≥ 4, then Assumption 2 and 3 hold.

Network autoregressive models for large dimension
The number of parameters in model (3) is of size O(pd 2 ). If the underlying network is not very sparse, a reasonable estimate could be only obtained if d n. Hence, in order to handle high-dimensional cases, meaning d is of the same order as n or even larger, we follow Knight et al. [11] and simplify model (3). For each component, the influence of the own lagged components is modeled separately, thus, we set e s G j (·)e s = 0 for all s = 1, . . . , d, j = 1, . . . , p. Then, the simplified model is given by where α j,r , β j,r ∈ R, j = 1, . . . , p, r = 1, . . . , d and Eε t;r = µ r . Hence, this simplified model possesses in total only d(2p + 1) parameters or more precisely only 2p + 1 parameters for each component of the time series independently of the dimension. The parameter α quantifies the linear influence of the same component and β the linear influence of the other components. Note that model (11) can be written as and consequently fits into the framework (3). We denote the process as Large Network AutoRegression (LNAR) and the coefficient matrices occurring in (12) by A j,α,β , j = 1, . . . , p. Since a LNAR is an NAR process, a stationary solution is given by Lemma 2.2 if det(I− p j=1 |A j,α,β |z j ) = 0 for all |z| ≤ 1 or ρ(Ã G (·)) < 1. If no restrictions on the underlying network are imposed, then the first condition implies that in order to obtain a stationary solution, the parameter space depends on d. This is not the case we would like to consider here, which is why conditions on the underlying network are imposed. We require that G j (·) ∞ ≤ 1, j = 1, . . . , p, which means that the sum of weights of the edges going into a vertex does not grow with the dimension d. To simplify notation, we bound the sum of weights by 1. Knight et al. [11] require a similar condition in the case of a static network. Under this condition, we obtain a stationary solution if max r=1,...,p p j=1 |α j,r |+|β j,r | ≤ C λ < 1, see the following Lemma 3.3. Note that under the same conditions Knight et al. [11] obtain a stationary solution in the case of a static network. ≤ 1 for j = 1, . . . , p, and max r=1,...,p p j=1 |α j,r | + |β j,r | ≤ C λ < 1, then (11) fulfills Assumption 1b) and possesses a stationary solution. The solution takes the form λ . For component r = 1, . . . , d, let w r = (α 1,r , β 1,r , . . . , α p,r , β p,r ) ∈ R 2p and Y (r) t−1 = (X t−1;r , e r G 1 (Ad t−1 )X t−1 , . . . , X t−p;r , e r G p (Ad t−p )X t−p ) . Then, (11) can be written as X t;r = w t Y (r) t−1 + ε t;r . This is the same framework as in Section 3.1, and the linear system (10) gives a least square estimate. To cover a high-dimensional setting, we study the theoretical properties of this estimator in a nonasymptotic framework as it is done in the high-dimensional vector autoregressive case, see among others [1]. We make use of the Nagaev inequality for dependent variables, see Theorem 2 in [18], to formulate nonasymptotic error bounds. Again, the physical dependency concept is used to quantify the dependency structure of Ad, see the following Assumption 4.
The constants appearing here do not depend on the dimension d.
Note that G(·) ∞ ≤ 1 and max r=1,...,d If Ad G1 possesses geometrically decaying physical dependence coefficients, then This Assumption implies that X as well as {Y (r) t , t ∈ Z} are q-stable and their physical dependency quantity ∞ j=0 δ q (·, j) can be bounded independently from the dimension d, see Lemma 3.4 for details.
Lemma 3.4. If Assumption 4 holds, and max i ε 0,i E,2q < ∞, then X generated by model (11) is q-stable and With this results, we can formulate the nonasymptotic error bounds. In order to handle a high-dimensional setting an import result is to obtain an error bound which grows only moderately with d, the dimension of the process. Note that in contrast to the estimation of a high-dimensional VAR system, e.g. [13], the dimension of the parameter vector does not depend on d. This enables us to obtain an error bound which does not depend on d at all, see the following Theorem 3.5 for details.  (11), and the estimators given by the linear system (10) for some y ∈ R with probability of at least (1 − c q (n − p) 1−q y −q − (c q + 2) exp(−c q (n − p)y 2 )) 4 =: C q (n, y) 4 , where c q , c q are constants depending on q only, the following error bounds For y = o(1/ √ n), the error tends to zero but the probability still faces 1 with increasing n. This rate is independent of the dimension d. This enables us to use LNAR for forecasting in a high-dimensional framework. The forecasting procedure is analogue to the one for the NAR approach, see the end of the previous subsection.

Numerical Examples
In this section, the forecasting performance of the models presented in Section 2 is investigated in finite samples. For a low-dimensional example and a high-dimensional example, we forecast X n+1 , . . . , X n+h based on observations X 1 , . . . , X n and Ad 1 , . . . , Ad n−1 , where h = 4. The performance is measured by computing the mean squared error (MSE) averaged over all components via a Monte Carlo simulation using B = 1000 repetitions, meaning where X n,i;j denotes the jth component of the nth observation of the ith Monte Carlo sample. In the following, we denote the approach using model (3) as NAR and the approach using model (11) as LNAR. As a benchmark, we use a vector autoregressive model given by X t = p j=1 A j X t−j + ε t , where A 1 , . . . , A p ∈ R d×d . This approach is denoted by VAR. The three models considered have a tuning parameter p which specifies the lag order. For all three models, the Bayesian Information Criterion (BIC) is used to automatically choose the lagorder p, see among others Section 5.5 in [4].
The approaches NAR and LNAR make use of the underlying network structure. That means in order to compute X h n+h , the approaches NAR and LNAR require an observation or at least an estimate of the underlying network structure. Both cases are considered here, in the first case we forecast Ad n , . . . , Ad n+h−1 based on the observations Ad 1 , . . . , Ad n−1 , and in the second case we assume that Ad n , . . . , Ad n+h−1 is observed. In order to distinguish between these two cases, we denote the forecast of approach NAR based on an estimated network by NAR( Ad) and the forecast based on a known network structure by NAR(Ad). An analogue notation is used for LNAR.
In the first example, a network with 4 vertices is considered. The adjacency matrix process Ad is a Markovian process and the edges are independent from each other. The process Ad is given by The process X is an NAR(1) process and is given by where α =  A realization of the network, the time series, and the sample autocovariance function are displayed in Figure 3. The edges (3, 1) and (1, 3) have a coefficient of 0, hence, whether they are present or not, they do not influence the time series X.
The model structure of Ad is used to compute a forecast of Ad. Thus, for each component of Ad, a discrete Markov chain is fitted to Ad 1 , . . . , Ad n−1 , and this Markov chain is then used to obtain a forecast for Ad n , . . . , Ad n+h−1 . For this the R-package markovchain was used.
The mean squared errors for the forecast horizons h = 1, . . . , 4 are displayed in Table 1. Note that an optimal one-step ahead forecast for this process would possess a forecast error of 1. This can be nearly achieved by NAR with a known network structure and moderate sample size. If the underlying network structure is unknown, NAR outperforms the other approaches for the forecast horizons up to h = 3. For forecast horizons further ahead, VAR performs slightly better. This drop in performance for horizons further ahead is mainly caused by the estimate of the underlying network structure. The approach used here causes thatÂd   In the second example, a Separable Temporal Exponential Random Graph Model (STERGM) is considered, see Krivitsky and Handcock [15] and also [16] for the used R package tergm. The network is generated using simulate.stergm of the R-package tergm with dissolution coefficient 4, formation coefficient − log (d/5 − 1)(1 + exp(4)) − 1 , and a mean density of 5/d. Networks of the sizes d = 10, 33, 100, and 500 are considered, and for each network size the sample sizes n = 100, 200, and 500. For d = 500, such a network has about 4000 edge changes from t = 1 to t = 100. Let Ad t be the adjacency matrix of such a network at time t and let B t = diag((1/(1 Ad t e i ) i=1,...,d ) be a diagonal matrix, where 1/0 is defined as 0. This defines the function G(Ad t ) = Ad t B t and let AdG t = G(Ad t ) = Ad t B t . This means e i AdG t apportions equally the weight 1 among the in-going edges to vertex i at time t, and we have AdG t ∞ = max i d s=1 Ad t;si /( d s=1 Ad t;si ) ≤ 1. Then, the process X is given by the following LNAR(1) model where ε t ∼ N (µ, 5Σ ε ) and µ = (1(−1), 2(−1) 2 , 3(−1) 3 , . . . , d(−1) d ) and Σ ε is a banded matrix with ones on the diagonal and 0.25(−1) j+1 , j = 1, . . . , d − 1 on the first off diagonal. The function G is considered as known, meaning AdG 1 , . . . , AdG n−1 is observed. Two approaches are used to obtain a forecast for Ad n , . . . , Ad n+h−1 . The first approach fits a STERGM model to Ad 1 , . . . , Ad n−1 and then generateŝ Ad Since a standard VAR model cannot be applied well to a high-dimensional setting, the VAR estimation is modified by adding sparsity constraints, meaning a coefficient A j,rc is set to zero for j = 1, . . . , p if n t=1 e r AdG t e c = 0. This is motivated by the fact that in model (18) if Ee r AdG t e c = 0, then X t−j;c , j = 1, . . . , p do not directly influence X t;r . Furthermore, forecasts given by the Rpackage BigVAR are included as additional benchmarks, see [26]. There the idea is that the underlying VAR model is sparse but the sparsity structure is unknown. It can be estimated by using a LASSO approach, see among others [1,13]. The forecast obtained by a such a model of order p is denoted by BigVAR(p), where p = 1, 2.
The mean squared errors of the forecasts are displayed in Table 3 to Table 4. Note that an optimal one-step ahead forecast would possess a forecast error of 5. With a known network structure at hand, LNAR is nearly able to achieve such optimal results independently of the dimension. Is the future network unknown, then the forecasting performance of NAR and LNAR drop considerably. Especially for LNAR, the loss of performance due to an unknown network structure seems to increase with the dimension of the process. Of the two approaches used to forecast the network, a better performance is given by the network forecasting approach Ad 1 for NAR as well as LNAR. Given the network forecast Ad 1 , both approaches still outperform all others namely VAR and BigVAR(p), p = 1, 2. As mentioned, VAR uses the underlying network structure to set sparsity constraints such that the number of parameters can be reduced. In all settings considered, VAR outperforms BigVAR(p), p = 1, 2. This indicates that for this process the network induced sparsity constraints are more helpful than a free but unknown sparsity setting as given in BigVAR. However, note that the amount of parameters which are estimated for the approaches NAR and VAR depend on the set I n := {i, r = 1, . . . , d : 1/n n−1 t=1 e r |Ad t |e i > 0}, and we have |I n | ≤ |I m | for m ≥ n. This could explain why these two approaches do not gain immediately from an increasing sample size for larger dimensions.

Real Data Example
In this section, we investigate further the example in which the actors are economies, their gross domestic product is the attribute of interest and their trade volume defines the underlying network. To elaborate, we consider the data set of [22]. This data set contains economic data of 33 economies in the time period from 1980-2016. The 33 economies cover more than 90% of world GDP, see Table 6 for a list of included economies. The economies are considered as actors, and the relationship between these actors is given by the IMF (International Monetary Fund) Direction of Trade statistics, see data.imf.org/DOT and also the trade matrix in [22]. For time t, the connection from actors i to actor j given by e j Ad t e i is defined as the sum of exports and imports between actor i and j at time t divided by the sum of all exports and imports of actor j at time t. The data set considered contains for each economy attributes such as real GDP (log transform), inflation rate, short/long-term interest rate. Note that these attributes are given quarterly whereas the trade relations are only given annually. We assume here that the trade relations do not change within a year and perform the analysis on the quarterly sampling level. The focus here is on the attribute real GDP, and based on the data from 1980Q1-2014Q4 the goal is to forecast the GDP for the period 2015Q1-2016Q4. The indices 1, . . . , n denote the time period 1980Q1-2014Q4 and n + 1, . . . , n + 8 the time period 2015Q1-2016Q4. To perform a forecast, we use the models presented here, namely NAR given by (3) and LNAR given by (11), and include a VAR model as a benchmark. It is a solid benchmark, since Marcellino [21] compared a VAR model GDP forecast with various nonlinear alternatives and pointed out that even though a VAR model is a "simple" linear model, it can hardly be beaten if it is carefully specified. Let {Y t } be the real GDP (log transform) of the 33 economies. Unit root tests applied to {Y t } suggest that real GDP itself may not be stationary. We follow here the economic literature, see among others [21], and model instead the GDP growth rate given by X t = Y t − Y t−1 . This transformation can be inverted, and we obtain a forecast for {Y t } byŶ n+s , s = 1, . . . , h denote forecasts of {X t }. The three models considered have a tuning parameter p which specifies the lag order. For all three models, the Bayesian Information Criterion (BIC) is used to choose automatically the lag-order p. The models NAR and LNAR require a forecast of the underlying network. A simple approach for this is used, namelyÂd n+h . The squared error and the absolute error is used to measure the forecast performance. In Table 5, the sum of all squared and absolute errors is displayed, meaning n+h 1 . Over all forecast horizons and economies, the additional network structure improves the forecast, and NAR performs best. NAR's forecast error is 25% for the squared error and 15% for the absolute error smaller than VAR's forecast error. Table 6 breaks the forecast error down into economies, and Table 7   To sum up, the trade network delivers useful information for the GDP forecast. The models presented in this paper are able to benefit from these additional information such that they can outperform the VAR approach. Note that an NAR(p) model possesses the same amount of parameters as a VAR(p) model. Thus, the additional information can be used without estimating additional parameters.   Table 7: For a given forecast horizon the sum of the squared error and absolute error, respectively, over all 33 economies.

Conclusions
This paper models dynamic attributes of the vertices of a dynamic network. The attributes are modeled such that the underlying network structure can influence the attributes and vice versa. A linear time series framework is adopted and network linear processes and network autoregressive processes were defined. This framework gives flexibility in the sense that the attributes and the underlying network can be modeled separately. The physical dependence framework is used to quantify the dependency structure of the underlying network such that this framework becomes feasible and statistical results can be derived in a lowand high-dimensional setting. These results can be used to do forecasting, and, as can be seen in the numerical examples as well as in the real data example, the benefit of using the additional structure can be quite large.
Acknowledgments. The author is grateful to the editor, an associate editor and one referee for their valuable and insightful comments that led to a considerably improved manuscript. The research of the author was supported by the Research Center (SFB) 884 "Political Economy of Reforms"(Project B6), funded by the German Research Foundation (DFG). Furthermore, the author acknowledges support by the state of Baden-Württemberg through bwHPC.
Let X t = (Ã G(Ad) t−1 )X t−1 + (e 1 ⊗ I d )ε t be the stacked NAR(1) process, where (6) takes the form X t = ∞ j=0 j s=1 (Ã G( Ad t−s ))(e 1 ⊗ I d )ε t−j . This is obviously a solution of the recursion equality. Given (6), the representation of the autocovariance function follows directly by taking into account that {ε t } is an i.i.d. sequence and {ε s , s > t} and {Ad s , s ≤ t} are independent for all t.
Proof of Lemma 2.3. Condition (ii) and (iii) gives the existence of the L 2 -Limit of X t , so that it can be written as X t = ∞ j=0 B t,j ε t−j . We have B t,j = f j (Ad t−1 , . . . , Ad t−j ), and {ε t , t ∈ Z} is i.i.d and independent to the stationary process Ad. Thus, {ε t , t ∈ Z} and (vec(B t,j , j ∈ N)) t∈Z are independent. We have µ x = ∞ j=0 EB 0,j µ, and for the autocovariance function Proof of Theorem 3.1. First note that we have for n large enough that (10) and obtain nconsistency of the estimators follows then by Assumption 2. Since √ n/(n − p) n t=p+1 (Y (r) t−1 ) (w r −ŵ r ) = µ Y (r) (w r −ŵ r ) + o P (1), we have the second assertion by Assumption 3.
Proof of Lemma 3.2. To simplify notation, let AdG t = G(Ad t ). Since Assumption 1a) gives a causal representation, see Lemma 2.2, we have for component r that X j;r = ∞ s=0 e r B j,s ε j−s = H r (Ξ j , Ξ j−1 , . . . ) for some measurable function H r . Note that B j,0 ≡ I p and B j,s ≤ |Ã| s 2 for all j. Denote by * a coupled version with Ξ 0 being replaced by an i.i.d. copy Ξ 0 . Then, δ q ({X t;r , t ∈ Z}, j) = X j;r − H r (Ξ j , Ξ j−1 , . . . , Ξ 1 , Ξ 0 , Ξ −1 , Ξ −2 , . . . ) E,q = X j;r − X * j;r E,q . We have by triangular inequality and Cauchy-Schwarz for j ≥ 1 Let D k = AdG j−k − AdG * j−k . Since Ad G is 2q-stable, we have that max i,s e i D k e s E,2q ≤ δ 2q (Ad G , j − k). Note that Ad G is a causal process and AdG t − AdG * t = 0 for all t < 0. Furthermore, we have by Assumption 1a) With this, we have further Similarly, we obtain max r X j;r E, Since the components are absolutely summable, see Lemma 2.2, we have C < ∞. This gives us Hence, the assertion that X is q-stable follows. Letk = k + (s − 1)d, k = 1, . . . , d, s = 1, . . . , p. Since Ad G is stationary and 2q-stable, we have 1(1/n n t=p+1 e r |G s (Ad t−s )|e k > 0) = 1(E(e k |G s (Ad −s )|e r > 0) > 0) + o P (1), where 1(·) is the indicator function, meaning 1(x) = 1 if x is true and zero otherwise. This implies, as n → ∞, |I (r) | P → |I (r) E | < ∞. Consequently, for n large enough, we have I (r) E = I (r) . Suppose in the following that this hold.
Note that for some vector v ∈ R 2p with v 1 = 1, we have ·;s }, j).