Generalized Yule-Walker Estimation for Spatio-Temporal Models with Unknown Diagonal Coefficients

We consider a class of spatio-temporal models which extend popular econometric spatial autoregressive panel data models by allowing the scalar coefficients for each location (or panel) different from each other. To overcome the innate endogeneity, we propose a generalized Yule-Walker estimation method which applies the least squares estimation to a Yule-Walker equation. The asymptotic theory is developed under the setting that both the sample size and the number of locations (or panels) tend to infinity under a general setting for stationary and alpha-mixing processes, which includes spatial autoregressive panel data models driven by i.i.d. innovations as special cases. The proposed methods are illustrated using both simulated and real data.


Introduction
The class of spatial autoregressive (SAR) models is introduced to model cross sectional dependence of different economic individuals at different locations (Cliff and Ord, 1973). More recent developments extend SAR models to spatial dynamic panel data (SDPD) models, i.e. adding time lagged terms to account for serial correlations across different locations. See, e.g. Lee and Yu (2010). Baltagi et al. (2003) consider a static spatial panel model where the error term is a SAR model. Lin and Lee (2010) show that in the presence of heteroskedastic disturbances, the maximum likelihood estimator for the SAR models without taking into account the heteroskedasticity is generally inconsistent and proposes an alternative GMM estimation method. Computationally the GMM methods are more efficient than the QML estimation (Lee, 2001). Lee and Yu (2010) classify SDPD models into three categories: stable, spatial cointegration and explosive cases. As pointed out by Bai and Shi (2011), the cases with a large number of cross sectional units and a long history are rare. Hence it is pertinent to consider the setting with short time spans in order to include as many * Correspondence to: Department of Statistics, London School of Economics, Houghton Street, London, WC2A 2AE, United Kingdom.
locations as possible. Both estimation method and asymptotic analysis need to be adapted under this new setting. Yu et al. (2008Yu et al. ( , 2012 investigate the asymptotic properties when both the number of locations and the length of time series tend to infinity for both the stable case and spatial cointegration case, and show that QMLE is consistent. Motivated by the evidence in some practical examples, we extend the model in Yu et al. (2008Yu et al. ( , 2012 by allowing the scalar coefficients for each location (or panel) different from each other. This increase in model capacity comes with the cost of estimating substantially more parameters. In fact that the number of the parameters in this new setting is in the order of the number of locations. The model considered in this paper has four additive components: a pure spatial effect, a pure dynamic effect, a time-lagged spatial effect and a white noise. Due to the innate endogeneity, the conventional regression estimation methods such as the least squares method directly based on the model lead to inconsistent estimators. To overcome the difficulties caused by the endogeneity, we propose a generalized Yule-Walker type estimator for estimating the parameters in the model, which applies the least squares estimation to a Yule-Walker equation. The asymptotic normality of the proposed estimators is established under the setting that both the sample size n and the number of locations (or panels) p tend to infinity. Therefore the number of parameters to be estimated also diverges to infinity, which is a marked difference from, e.g., Yu et al. (2012). We develop the asymptotic properties under a general setting for stationary and αmixing processes, which includes the spatial autoregressive panel data models driven by i.i.d. innovations as special cases.
The rest of the paper is organized as follows. Section 2 introduces the new model, its motivation and the generalized Yule-Walker estimation method. The asymptotic theory for the proposed estimation method is presented in Section 3. Simulation results and real data analysis are reported, respectively, in Sections 4 and 5. All the technical proofs are relegated to an Appendix.

Models
The model considered in this paper is of the following form: where y t = (y 1,t , . . . , y p,t ) T represents the observations from p locations at time t, D(λ k ) = diag(λ k1 , . . . , λ kp ) and λ kj is the unknown coefficient parameter for the jth location, and W is the p×p spatial weight matrix which measures the dependence among different locations. All the main diagonal elements of W are zero. It is a common practice in spatial econometrics to assume W known.
For example, we may let w ij = 1/(1+d ij ), for i ̸ = j, where d ij ≥ 0 is an appropriate distance between the ith and the jth location. It can simply be the geographical distance between the two locations or the distance reflecting the correlation or association between the variables at the two locations. In the above model, D(λ 0 ) captures the pure spatial effect, D(λ 1 ) captures the pure dynamic effect, and D(λ 2 ) captures the time-lagged spatial effect. We also assume that the error term ε t = (ε 1,t , ε 2,t , . . . , ε p,t ) T in (1) satisfies the condition Cov (y t−1 , ε t ) = 0. When λ k1 = · · · = λ kp for k = 0, 1, 2, (1) reduces to the model of Yu et al. (2008), in which there are only 3 unknown regressive coefficient parameters. In general the regression function in (1) contains 3p unknown parameters.
The extension to use different scalar coefficients for different locations is motivated by practical needs. For example, we analyze the monthly change rates of the consumer price index (CPI) for the EU member states over the years 2003-2010. The detailed analysis for this data set will be presented in Section 5. Fig. 1 presents the scatter-plots of the observed data y i,t versus the spatial regressor w T i y t and y i,t−1 , for some of the EU member states, where w T i is the ith row vector of the weight matrix W which is taken as the sample correlation matrix with all the elements on the main diagonal set to be 0. The superimposed straight lines are the simple regression lines estimated using the newly proposed method in Section 2.2. It is clear from Fig. 1 that at least Greece and Belgium should have a different slope from those of France or Iceland.

Generalized Yule-Walker estimation
As y t occurs on both sides of (1), Wy t and ε t are correlated with each other. Applying least squares method directly based on regressing y t on (Wy t , y t−1 , Wy t−1 ) leads to inconsistent estimators. On the other hand, applying the maximum likelihood estimation requires to profile a p × p nuisance parameter matrix ε = Var(ε t ), which leads to a complex nonlinear optimization problem. Furthermore when p is large in relation to n, the numerical stability is of concern. We propose below a new estimation method which applies the least squares method to each individual row of a Yule-Walker equation. To this end, let k = Cov(y t+k , y t ) for any k ≥ 0.
Note that we always assume that y t is stationary, see condition A2 and Remark 1 in Section 3. Then the Yule-Walker equation below follows from (1) directly.
where w i is the ith row vector of W, and e i is the unit vector with the ith element equal to 1. Note that (2) is a system of p linear equations with three unknown parameters λ 0i , λ 1i and λ 2i . Since Ey t = 0, we replace 1 and 0 by the sample (auto)covariance We estimate (λ 0i , λ 1i , λ 2i ) T by the least squares method, i.e. to solve the minimization problem The resulting estimators are called generalized Yule-Walker estimators which admit the explicit expression: where More explicitly, Then it holds that for i = 1, . . . , p,

A root-n consistent estimator for large p
When p/ √ n → ∞, the estimator (3) admits non-standard convergence rates (i.e. the rates different from √ n); see Theorems 2 and 4 in Section 3. Note that there are p equations with only 3 parameters in (2). Hence (3) can be viewed as a GMME for an overdetermined scenario. The estimation may suffer when the number of estimation equations increases. See, for example, a similar result in Theorem 1 of Chang et al. (in press, 2014b). A further compounding factor is that the estimation for the covariance matrices 0 , 1 using their sample counterparts leads to non-negligible errors even when n → ∞. Below we propose an alternative estimator which restricts the number of the estimation equations to be used in order to restore the √ n-consistency and the asymptotic normality.
Then ρ (i) k may be viewed as a measure for the correlation between y k,t−1 and (w T i y t , y i,t−1 , w T i y t−1 ) T . When ρ (i) k is small, say, close to 0, the kth equation in (2) carries little information on (λ 0i , λ 1i , λ 2i ). Therefore as far as the estimation for (λ 0i , λ 1i , λ 2i ) is concerned, we only keep the kth equation in (2) for large ρ (i) k . Let z i t−1 be the d i × 1 vector consisting of those y k,t−1 corresponding to the d i largest  ρ (i) k is defined as in (4) but with ( 1 , 0 ) replaced by (  1 ,  0 ). The new estimator is defined as where and Theorem 3 in Section 3 shows the asymptotic normality of the above estimator provided that the number of estimation equations

Theoretical properties
We introduce some notations first. For a p × 1 vector where F j i denotes the σ -algebra generated by {y t , i ≤ t ≤ j}.
Condition A2(c) limits the dependence across different spatial locations. It is implied by, for example, the conditions imposed in Yu et al. (2008). Lemma 1 in the Appendix shows that Condition A2 holds with γ = 4 under conditions A1 and B1-B3. Note that conditions B1-B3 are often directly imposed in the spatial econometrics literature including, for example, Lee and Yu (2010), and Yu et al. (2008).
The density function of ε i,t exists. B2. The row and column sums of |W| and   S −1 (λ 0 )   are bounded uniformly in p. B3. The row and column sums of Now we are ready to present the asymptotic properties for . . , p, with fixed p and n → ∞ first, and then p → ∞ and n → ∞.

Asymptotics for fixed p
Theorem 1. Let conditions A1-A3 hold and p ≥ 1 be fixed. Then as n → ∞, it holds that where V i and U i are given in (8) and (9).

Asymptotics with diverging p
When p diverges together with n, U i , V i in (9) and (8) Theorem 2 indicates that the standard root-n convergence rate prevails as long as p = o( √ n). However the convergence rate may be slower when p is of higher orders than √ n. Theorem 2 presents the convergence rates for the L 2 norm of the estimation errors. The rates also hold for the L 1 norm of the errors as well. Corollary 1 consider the estimation errors over p locations together, for which we have established the result for L 1 norm only.
Corollary 1. Let condition A1 hold, and conditions A2 and A3 hold for all i = 1, . . . , p. Then as n → ∞ and p → ∞, it holds that To derive the asymptotic properties of the estimators defined in (5), we introduce some new notation.
and the equation given in Box I. Theorem 3 indicates that the estimators defined in (5) are asymptotically normal with the standard √ n-rate as long as Note that it does not impose any conditions directly on the size of p.
The key assumption of Theorem 2 is A2(c), which decides the fact that the effect of the dimensionality p only comes from E 1 in Eq. (13) in the Appendix. We can relax this assumption by allowing E 2 to be affected by p as well. Under the new relaxed assumption, we may obtain a better convergent rate of estimator (3) by making use of the fact that (3) is invariant if we divide both the numerator and denominator by the same number, for example, a number relating to p. This will be presented in Theorem 4. We propose the new relaxed assumption: and the diagonal elements of V i defined in (8) is in the order of s 2 (p), where s 0 (p), s 1 (p) and s 2 (p) are numbers relating to p.
Denote C as a constant. When the number of nonzero elements (or elements bounded away from zero) in w i increases with p but Theorem 4. Let conditions A1, A2(a, b), A3 and A5 hold. As n → ∞, , which corresponds with Theorem 2. Theorem 4 indicates that under different situations of s 0 (p), s 1 (p) and s 2 (p), we may obtain different convergence rates. These observations are illustrated by simulation examples in Section 4.

Simulation study
To examine the finite sample performance of the proposed estimation methods, we conduct some simulation under different scenarios.

Scenario 1
λ 0i , λ 1i and λ 2i are generated from U(−0.6, 0.6). The spatial weight matrix W used is a block diagonal matrix formed by a √ p × √ p row-normalized matrix W * . We construct W * such that the first four sub-diagonal elements are all 1 and the rest elements are all 0 before normalizing. This kind of W corresponds to the pooling of √ p separate districts with similar neighboring structures in each district, see Lee and Yu (2013). The error ε i,t are independently generated from N(0, σ 2 i ), where we generate each σ i from U(0.5, 1.5).
For all scenarios, we generate data from (2.1) with different settings for n and p. We apply the proposed estimation method (2.3) and (2.5) (with d i = min(p, n 10/21 )) and report the mean absolute errors: We replicate each setting 500 times. Fig. 2 depicts two boxplots of MAE with p equal to, respectively, 25 and 100. As the sample size n increases from 100, 250, 500, 750 to 1000, MAE decreases for both methods. This is due to the fact that  Z T i  Z i is nearly singular for large p. Adding a ridge in the estimator certainly mitigates the deterioration when p increases; see the panel on the right in Fig. 3.

Scenario 2
λ 0i , λ 1i and λ 2i are generated from U(−0.6, 0.6). The spatial weight matrix W is constructed as follows. First, we construct a √ p× √ p row-normalized matrix W * , where W * is chosen such that the first two sub-diagonal elements are all 1 and the rest elements are all 0 before normalizing. Then we treat W as a √ p × √ p block matrix and put W * into the main diagonal, 2nd, 4th, 6th and etc. sub-diagonal block positions. This kind of W corresponds to the pooling of √ p districts (each district has √ p locations) which the evenly numbered districts are connected and the oddly numbered districts are connected but evenly numbered districts and oddly number districts are separated. Each district has similar neighboring structures. As p increases, the number of the locations influencing one specific location increases in the order of √ p. The error ε i,t are independently generated from N(0, σ 2 i ), where we generate each σ i from U(0.5, 1.5). Fig. 4 depicts two boxplots of MAE with p equal to, respectively, 25 and 100. As the sample size n increases from 100, 250, 500, 750 to 1000, MAE decreases for both methods. Fig. 5 depicts three boxplots as Fig. 3. The MAE for (2.3) increases steadily as p increases, which matches the result of Theorem 4 when, for instance, s 1 (p) ≍ √ p, s 0 (p) ≍ p and s 2 (p) ≍ p. The MAE for (2.5) after adding ridge penalty is slowly increasing as well. This might be caused by the fact that, similar to A2(c), quantities in condition A4(a) are also influenced by p since the number of nonzero elements in w i is in the order of √ p.

European consumer price indices
We analyze the monthly change rates of the consumer price index (CPI) for the EU member states, over the years 2003-2010. We use the national harmonized index of consumer prices calculated by Eurostat, the statistical office of the European Union.
For this data set, n = 96 and p = 31. Fig. 6 presents the time series plots of the monthly change rates of CPI for the 31 states. To line up the curves together, each series is centered at its mean value in Fig. 6. There exist clearly synchronizes on the fluctuations across different states, indicating the spatial (i.e. cross-state) correlations among different states. Also noticeable is the varying degrees of the fluctuation over the different states.
Let y t consist of the monthly change rates of CPI for the 31 states. We fit the proposed spatial-temporal model (1) to this data set with the parameters estimated by (3). We take a normalized sample correlation matrix of y t as the spatial weight matrix W = (w ij ), i.e. we let w ij be the absolute value of the sample correlation between the ith and jth states for i ̸ = j, and w ii = 0, and then replace w ij by w ij /  k w kj . Fig. 7 presents the scatter plots of y i,t against, respectively, the 3 regressors in model (1), i.e. w T i y t , y i,t−1 , w T i y t−1 , for four selected states Belgium, Greece, France and Iceland. We superimpose the straight line y =  λ ji x in each of those 3 scatter plots with, respectively, j = 0, 1, 2. It is clear that the estimated slopes are very different for those 4 states. Fig. 8 plots the true monthly change rates of the CPI for those 4 states together with the fitted values Overall  y i,t tracks its truth value reasonably well. Fig. 9 shows the out-of-sample forecasting performance of our model. For the sake of comparison, predictions are made using our model and the proposed generalized Yule-Walker estimator, and using the (constant) SDPD model of Yu et al. (2008) and their Quasi-Maximum Likelihood estimator. In particular, for each location,     we leave out from the sample the last six observations and we compute the (out-of-sample) forecasts with 1, 2, . . . , 6 step ahead forecasting horizon; then, we compute the average prediction error over time (i.e. the mean of the 6 prediction errors). On the left panel of Fig. 9, the two box-plots summarize the average prediction error for the 31 locations obtained with our YW estimator and the QML estimator of Yu et al. (2008), respectively. It is evident that our estimator produces unbiased predictions while the QML estimator appears to be biased. This advantage also reflects on the forecasting average square errors, reported on the right panel of Fig. 9. In conclusion, the SDPD model of Yu et al. (2008) has a satisfying forecasting performance because several locations have similar spatial structure and for those locations a model with constant parameters is sufficient. Anyway, a marginal improvement is observed for our estimator because several locations have quite different structures and our model is able to capture this difference. Finally, it is worthwhile to notice that the variability of the two predictors appears to be the same.
To further vindicate the necessity to use different coefficients for different states, we consider a statistical test for hypothesis H 0 : λ j1 = · · · = λ jp , j = 0, 1, 2 for model (1). Then the residuals resulting from the fitted model under H 0 will be greater than the residuals without H 0 . However if H 0 is true, the difference between the two sets of residuals should not be significant. We apply a bootstrap method to test this significance. Let  λ 0 ,  λ 1 ,  λ 2 be the estimates under hypothesis H 0 .

Define the test statistic
We reject H 0 for large values of U. To assess how large is large, we generate a bootstrap data from . . , n, and  y t consists of the components defined in (12). Now the bootstrap statistic is defined as where (λ * 0 , λ * 1 , λ * 2 ) are the estimated coefficients for the regression model y * t = λ 0 Wy t + λ 1 y t−1 + λ 2 Wy t−1 + ε t , t = 1, . . . , n.
The P-value for testing hypothesis H 0 is defined as P(U * > U|y 1 , . . . , y n ), which is approximated by the relative frequency of the event (U * > U) in a repeated bootstrap sampling with a large number of replications. By repeating bootstrap sampling 1000 times, the estimated P-value is 0, exhibiting strong evidence against the null hypothesis H 0 . Therefore the model with the equal slope parameters across different locations is inadequate for this particular data set.

Modeling mortality rates
Now we analyze the annual Italian male and female mortality rates for different ages (between 0 and 104) in the period of 1950-2009 based on the proposed model (1). The data were downloaded from the Human Mortality Database (see the website http://www.mortality.org/). Let m i,t be the log mortality rate of female or male at age i and in Year t. Those data are plotted in Fig. 10. Two panels on the left plot are the female and male mortality against different age in each year. More precisely the curves {m i,t , i = 1, . . . , 21} for t < 1970 are plotted in red, those for t > 1990 are in blue, those with 1970 ≤ t ≤ 1989 are in gray.
Those curves show clearly that the mortality rate decreases over the years for almost all age groups (except a few outliers at the top end). Two panels in the middle of Fig. 10 plot the log mortality for each age group against time with the following color code: black for ages not great than 10, gray for ages between 11 and 100, and green for ages greater than 100. They indicate that the mortality for all age groups decreases over time, the most significant decreases occur at the young age groups. Furthermore, the fluctuation of the mortality rates for the top age groups reduces significantly over the years, while the mean mortality rates for those groups remain about the same. This can be seen more clearly in the two panels on the right which plot differenced log mortality rates {y i,t , t = 1951, . . . , 2009}, using the same color code, where We fit the differenced log mortality data with model (1) with the parameters estimated by (5) and d i = 20. Note that now p = 104 and n = 59. Let the off-diagonal elements of the spatial weight matrix W be We then replace w ij by w ij /  i w ij . Moreover, we can also fix a threshold τ and set to zero all the elements of matrix W such that |x − w| > τ (for simplicity, we fix τ = 5 in this application, but the results are substantially invariant for different values of τ ). , against year for each age group between 0 and 104 (2 middle panels). Differenced log mortality rates are plotted against year for each age in 2 right panels.

Table 1
Estimated coefficients for a selection of cohorts of different ages. The left column is the estimated pure spatial coefficients  λ 0i ; The middle column is the estimated pure dynamic coefficient  λ 1i ; The right column is the estimated spatial-dynamic coefficients  λ 2i . The results of the estimation are shown in Table 1, for a selection of cohorts of different ages. Fig. 11 shows the fitted series for ages i = 60, 80, 100.

Final remark
We propose in this paper a generalized Yule-Walker estimation method for spatio-temporal models with diagonal coefficients. The setting enlarges the capacity of the popular spatial dynamic panel data models. Both the asymptotic results and numerical illustration show that the proposed estimation method works well, although the number of the estimation equations utilized should be of the order o( √ n).

Appendix. Proofs
We present the proofs for Theorem 2, Corollary 1 and Theorem 4 in this appendix. The proofs for Theorems 1 and 3 are similar and simpler than that of Theorem 2, and they are therefore omitted. We also present a lemma (i.e. Lemma 1) at the end of this appendix, which shows that condition A2 is implied by conditions A1 and B1-B3; see Remark 1. We use C to denote a generic positive constant, which may be different at different places.
Proof of Theorem 2. We first prove (i) of Theorem 2. We only need to prove the assertions (1) and (2) below, as then the required conclusion follows from (1) and (2) immediately. (1) To prove (1), it suffices to show that for any nonzero vector a = (a 1 , a 2 , a 3 ) T , the linear combination For term E 1 and k = 1, 2, . . . , p, by Proposition 2.5 of Fan and Yao Var(e T k y t−1 w T i y t ) where C is independent of p. Then it holds that Similarly, (14), we have Var( is asymptotic normal. Note that it holds that Now we calculate the variance of S n,p . It holds that and it follows from Calculating all the variance and covariance and summing up them, it follows from dominate convergence theorem that To prove the asymptotic normality of S n,p , we employ the smallblock and large-block arguments. We partition the set {1, 2, . . . , n} into 2k n + 1 subsets with large blocks of size l n , small blocks of size s n and the last remaining set of size n − k n (l n + s n ). Put Note that l n / √ n → 0 is important when we derive the Lindeberg condition of the truncated partial sum T L n,p defined in (16).
Then we can partition S n,p in the following way Note that α(n) = o(n − (2+γ /2)2 2(2+γ /2−2) ) and k n s n /n → 0, (l n +s n )/n → 0, by applying Proposition 2.7 of Fan and Yao (2003), it holds that We calculate the variance of T n,p . Similar to (15), it holds that Calculating all the variance and covariance and summing up them, by dominated convergence theorem and k n l n n → 1, it holds that Now it suffices to prove the asymptotic normality of T n,p . We partition T n,p into two parts via truncation. Specifically, we define Similarly, we have ξ (2)L j , ξ (2)R j and ξ (3)L j , ξ (3)R j . Then Similar to computing the Var(T n,p ), it holds that where Ω L is the sum of all the rest variance and covariance except where we denote σ 2 L as the asymptotic variance of T L n,p . Similarly, we have where i = √ −1 now. We bound M n,p as follows Following the same arguments as part 2.7.7 of Fan and Yao (2003), for any ϵ > 0, it holds that M n,p < ϵ as n, p → ∞. Hence which leads to the fact that To prove (2), let us look at the (1, 1)th element of  X T i  X i . We have Using the same arguments as (14), the first term is O p ( p n ) and the second term is O p ( 1 √ n ). Hence given p = o(n), it holds that Applying the same arguments to the other elements of  X T i  X i , it holds that To prove (ii) in Theorem 2, the required asymptotic result follows from (13) and (17) immediately when p = o(n) and √ n = O(p). The proof is completed.
Proof of Corollary 1. By Theorem 2, it holds that for all i. The required asymptotic result follows from the above result directly.
Proof of Theorem 4. Let us look at term E 1 and E 2 in (13) first under the new condition (A5). Similar to the proof of (14), it holds that The required result then follows directly.
Proof. It is apparent that part (a) of A2 is satisfied under A1 and B1-B3. y t is strictly stationary because ε i,t are i.i.d across i and t and condition B3. Since the density function of ε i,t exists, α(n) decays exponentially fast, see Pham and Tran (1985). Therefore  ∞ j=1 α(j) γ 4+γ < ∞. Now we prove A2(c) when γ = 4. We present a more general result first: for any p × 1 vector a satisfying sup p ∥a∥ 1 < ∞, it holds that Note that