Sparse Generalized Yule-Walker Estimation for Large Spatio-temporal Autoregressions with an Application to NO2 Satellite Data

We consider a high-dimensional model in which variables are observed over time and space. The model consists of a spatio-temporal regression containing a time lag and a spatial lag of the dependent variable. Unlike classical spatial autoregressive models, we do not rely on a predetermined spatial interaction matrix, but infer all spatial interactions from the data. Assuming sparsity, we estimate the spatial and temporal dependence fully data-driven by penalizing a set of Yule-Walker equations. This regularization can be left unstructured, but we also propose customized shrinkage procedures when observations originate from spatial grids (e.g. satellite images). Finite sample error bounds are derived and estimation consistency is established in an asymptotic framework wherein the sample size and the number of spatial units diverge jointly. Exogenous variables can be included as well. A simulation exercise shows strong finite sample performance compared to competing procedures. As an empirical application, we model satellite measured NO2 concentrations in London. Our approach delivers forecast improvements over a competitive benchmark and we discover evidence for strong spatial interactions.


Introduction
Spatio-temporal models are powerful tools to explain and exploit dependencies between variables that are observed over both time and space, but they come with a number of challenges. In particular, endogeneity issues arise because the contemporaneous observations occur on both sides of the model equation. Furthermore, the inclusion of both spatial and temporal lags quickly results in heavily parameterized models. To circumvent these issues, a large part of the literature incorporates predetermined spatial weight matrices that govern the contemporaneous interactions between spatial units. Examples of this modelling strategies are: the spatial autoregressive model with a Gaussian quasi-maximum likelihood estimator (QMLE) by Lee (2004); the QMLE estimation of stationary spatial panels with fixed effects detailed in Yu et al. (2008); the extension of these spatial panels to include spatially autoregressive disturbances as in Lee and Yu (2010); a further extension to a non-stationary setting in which units can be spatially cointegrated in Yu et al. (2012); and the computationally beneficial generalized method of moments (GMM) estimator by Lee and Yu (2014). While the choice of the spatial weight matrix is a key element of the model specification, its selection process can feel somewhat arbitrary and/or tedious. The arbitrariness might prevail when practical considerations fail to suggest a particular mechanism for the spatial interactions.
Accordingly, more recent literature focuses on either incorporating multiple weight matrices (e.g. Debarsy and LeSage, 2018;Zhang and Yu, 2018) or, at the expense of estimating many parameters, directly inferring all spatial interactions from the data (e.g. Lam and Souza, 2019; Gao et al., 2019;Ma et al., 2021). We contribute to the latter strand of literature with the development of a new estimator that provides several unique benefits.
In this paper, we propose the SPatial LAsso-type SHrinkage (SPLASH) estimator as a fully datadriven estimator of spatio-temporal interactions. Apart from a generous bandwidth upper bound, this SPLASH estimator leaves the spatial weight matrix and autoregressive matrix unspecified while employing a lasso approach to recover sparse solutions. Building upon previous works by Dou et al. (2016) and Gao et al. (2019), we resolve the endogeneity problem in our spatio-temporal regressions by estimating the generalized Yule-Walker equations. Our contributions are five-fold.
First, assuming sparsity in the coefficient matrices and general mixing conditions on the innovations, we derive finite-sample performance bounds for the estimation and prediction error of our estimator.
We subsequently utilize these bounds to derive asymptotic consistency in a variety of settings.
For example, in the special case of a finitely bounded bandwidth and unstructured sparsity, it follows that the number of spatial units N may grow at any polynomial rate of the number of temporal observations T . Second, we adopt a banded estimation procedure for the autocovariance matrices that underlie the generalized Yule-Walker equations. The faster convergence rates of these banded autocovariance matrix estimators are shown to translate into better convergence rates of our SPLASH estimator. Third, we show that dependence between neighbouring units that are ordered on a spatial grid translates to diagonally structured forms of sparsity in the spatial weight matrix.
A tailored regularization procedure reminiscent of the sparse group lasso is proposed. Fourth, we generalize the model to include exogenous variable and demonstrate that an extended system of Yule-Walker equations continues to provide consistent estimators. Our simulation study confirms these points. Fifth and final, we employ SPLASH to predict NO 2 concentrations in London from satellite data.
Elaborating on the empirical application, we collect daily NO 2 column densities from August 2018 to October 2020, recorded by the TROPOspheric Monitoring Instrument (TROPOMI) on board of the Corpernicus Sentinel-5 Precursor satellite. Each spatial unit is an aggregation of a small number of pixels on the satellite image. We find that SPLASH constructs more accurate one-step ahead predictions for all spatial units compared to the procedure in Gao et al. (2019), while outperforming a competitive penalized VAR benchmark for the majority of spatial units. In addition, we find evidence for spatial interactions between first-order neighbours and second-order neighbours (i.e. neighbours of neighbours).
There are two strands of literature that are closely linked to this work: the literature on the estimation of (nonparametric) spatial weight matrices and the literature on spatio-temporal vector autoregressions. Some similarities and differences are as follows. Lam and Souza (2014) consider a model specification where the spatial units depend linearly on a spatial lag and exogenous regressors.
The adaptive lasso is proven to select the correct sparsity pattern. To solve the endogeneity issue, they require the error variance to decay to zero as the time dimension grows large. Ahrens and Bhattacharjee (2015) solve the endogeneity problem using external instruments. Their two-step lasso estimation procedure selects the relevant instruments in the first step and the relevant spatial interactions in the second step. The theoretical properties of this estimator are derived using moderate deviation theory as in Jing et al. (2003). This approach requires the instruments and the idiosyncratic component to be serially independent. Clearly, a serial independence assumption is unrealistic for the spatio-temporal models we consider here. Finally, Lam and Souza (2019) augment a spatial lag model with a set of potentially endogenous variables (the augmenting set). They decompose the spatial weight matrix into a pre-determined component based on expert knowledge and a sparse adjustment matrix that represents specification errors. The adjustment matrix is sparsely estimated based on a penalized version of instrumental variables (IV) regression. If these instrumental variables are selected as temporal lags of the dependence variable, then their IV regressions are similar to generalized Yule-Walker estimation. In contrast to our approach, Lam and Souza (2019) do not regularize the interactions between the dependent variables and the variables in the augmenting set, and they assume the number of such interactions to be fixed. A fixed number of interactions is inappropriate in high-dimensional settings in which the number of spatial units is allowed to diverge.
Closest related to the our work are Gao et al. (2019), and Ma et al. (2021). Both papers consider the same model and an estimation procedure that relies on generalized Yule-Walker equations. The key difference with our paper lies in the method by which the model complexity is controlled during estimation. Gao et al. (2019) assume the coefficient matrices to be banded with a bandwidth that is small compared to the number of spatial units. The bandwidth is determined from the data and all parameters within the selected bandwidth are left unregularized. Our SPLASH estimator, however, has the ability to exploit (structured) sparsity within the bandwidth and thereby improve estimation and forecasting performance. In addition, apart from a generous upper bound on the bandwidth to ensure identification, SPLASH does not require an a priori choice regarding the bandwidth. The recently developed bagging approach in Ma et al. (2021) does allow for sparsity within the bands, yet it also requires the calculation of so-called solution paths. That is, a forward addition and backward deletion stage are needed to determine the variables that enter the final model specification. In contrast, the SPLASH estimator provides this solution at once. Furthermore, their approach is not designed to detect diagonally structured forms of sparsity, while the ability to do so results in clear performance improvements of SPLASH in both the simulations and the empirical application considered below. This paper is organized as follows. Section 2 introduces the spatio-temporal vector autoregression and the banded autocovariance estimator that underlies the generalized Yule-Walker estimation approach. The SPLASH estimator and its theoretical properties are discussed in Section 3. The simulation results in Section 4 and the empirical application in Section 5 demonstrate the benefits of the SPLASH estimator. Section 6 concludes.

Notation
The indicator function 1 {A} equals 1 if A is true and zero otherwise. For a vector x ∈ R N , the L p -norm of x is denoted x p = (

The Spatio-temporal Vector Autoregression
As in the recent paper by Gao et al. (2019), we consider the spatio-temporal vector autoregression where y t = (y 1t , . . . , y N t ) stacks the observations at time t over a collection of N spatial units.
The contemporaneous spatial dependence between these spatial units is governed by the matrix incorporates dependence on past realizations. Finally, we have the innovation vector t . We impose the following assumptions on the DGP in (1).
Remark 1. Assumption 1 is defined in terms of · . Since A 1 = A ∞ ≤ A for any matrix A, the norm · is convenient when bounding products of matrices containing transposes.
(b) Either one of the following assumptions holds: Assumption 1 ensures that y t = Ay t + By t−1 + t has a stable reduced form VAR(1) specification. This follows from the following two observations. First, Assumption 2(a) bounds the maximum row and column sums of A and thereby constraints the contemporaneous dependence between the time series. This assumption reminds of the spatial econometrics literature in which the spatial parameter λ is bounded from above and the prespecified spatial weight matrix W N is standardized (see, e.g. Lee (2004) and Lee and Yu (2010)). Typically, the product λW N -the natural counterpart of the matrix A -is required to fulfil conditions similar to A ≤ δ A < 1. 1 we infer that the absolute row and column sum of I N − A are bounded. The latter is the logical counterpart of assumption B2 in Dou et al. (2016). Second, Assumption 2(b) controls serial dependence. Indeed, we conclude from C 2 ≤ C ≤ C B 1−δ A < 1 that both unit root and explosive behaviour of the reduced form specification are ruled out. The resulting stable VAR(1) representation is convenient to study the theoretical properties of our penalized estimator.
The assumptions on the innovation process { t }, Assumption 2, are closely related to those in Masini et al. (2019). Assumption 2(a) places restrictions on the time series properties of the error term through martingale difference (m.d.) and mixing assumptions. The m.d. assumption implies that E( t y t−j ) = 0 while the mixing assumption controls the serial correlation in the data.
Polynomial or exponential tail decay of the distribution of the innovations is imposed through either Assumption 2(b1) or Assumption 2(b2), respectively. The type of tail decay will directly influence the growth rates we can allow for N and T . The discussions in Masini et al. (2019) demonstrate that Assumption 2 allows for a wide range of innovation models.
Any further structure being absent, there are (2N − 1)N unknown parameters in A and B to estimate. Three complications are encountered when estimating these parameters. First, if A = O, then y t occurs on both sides of the equation, and we face an endogeneity problem which renders OLS estimation inconsistent. Second, the number of unknown parameters grows quadratically in the cross-sectional dimension N . The model thus quickly becomes too large to estimate accurately without regularization. Finally, the multitude of parameters raises concerns about identifiability.
These three complications are addressed by: (1) imposing structure on the matrices A and B, and (2) estimating the unknown coefficients using the Yule-Walker equations (e.g. Brockwell and Davis, 1991, p. 420).
There are several possibilities to introduce structure into A and B. Early spatial econometrics models, e.g. the spatial autoregressive (SAR) model or spatial Durbin model (SDM), incorporate spatial effects through the product λW N (with W N pre-specified). The specification A = λW N imposes substantial structure on A and leaves only the single parameter λ to estimate. Dou et al. (2016) consider a more general setting in which each row of W N receives its own spatial autoregressive parameter. Specifically, they set A = diag(λ 0 )W N and B = diag(λ 1 )+diag(λ 2 )W N , and estimate the 3N coefficients in (λ 0 , λ 1 , λ 2 ) . Gao et al. (2019) require A and B to be banded matrices. 2 We employ a similar assumption.
1 For instance, it is not uncommon to row-normalize W N (each absolute row sum equal to 1) and restrict λ < 1, see pages 1903-1904of Lee (2004. If W N is symmetric, then also λW N < 1. 2 The matrix A has bandwidth k if the total number of nonzero entries in any row or column is at most k.
Assumption 3 serves two purposes. First, for each spatial unit i = 1, . . . , N , the matrices A and B are banded to have no more than N unknown parameters per equation. With N moment conditions for each i, Assumption 3(a) is key in identifying the parameters. Our discussions in Section 3 illustrate that this assumption is realistic when the data is observed on a regular grid.
The combination of Assumptions 3(a)-(b) is exploited in the Yule-Walker estimation approach. This approach requires estimation of the (N × N ) autocovariance matrices Σ j = E(y t y t−j ). Especially in our large N settings, it is crucial to rely on covariance matrix estimators that converge at a fast rate. If A, B, and Σ are banded, then the following result applies.
Theorem 1 shows that banded estimators for Σ 0 and Σ 1 provide an accurate approximation to V = [Σ 1 Σ 0 ] . Each of these banded matrices has at most 2h( ) + 1 nonzero elements in their columns/rows. In other words, given , l 0 and k 0 , Assumptions 1-3 guarantee that Σ 0 and Σ 1 can be well-approximated by matrices with bandwidths smaller than N . This improves the convergence rate of our estimator.

The SPLASH(α,λ) Estimator
Even under Assumption 3, the number of unknown parameters in A and B continues to grow quadratically in N . For large N , the accurate estimation of all these parameters becomes infeasible rather quickly. To alleviate this curse of dimensionality, we rely on sparsity. Sparsity naturally occurs when two spatial units do not interact with each other. We demonstrate, however, that a special, and exploitable, sparsity pattern arises whenever the spatial units are ordered in a structured way.
As an illustrative example, let us consider repeated measurements on the (5 × 5) spatial grids shown in the left column of Figure 1. The N = 25 spatial units are labelled y 1 up to y 25 and enumerated row-wise. This ordering of the spatial entities creates an implicit notion of proximity and we intuitively expect economic/physical interactions to be most pronounced at short length scales. In Figure 1(a) we start from the situation in which the spatial units are restricted to communicate horizontally. Blue arrows indicate explicitly that y 1 interacts with y 2 , and y 14 interacts with both y 13 and y 15 . Such interactions occur among all elements in the grid. More importantly, if only these horizontal interaction exist, then the (25 × 25) matrices A and B feature a sparsity pattern as shown in Figure 1(b). The blue elements are potentially nonzero whereas uncolored elements are zero. The nonzero elements in A and B are seen to cluster in specific, dense diagonals with the occasional zero when horizontal neighbours are absent (at the boundary of the grid).
This diagonal sparsity pattern is not an artifact of allowing horizontal interactions only. Figures 1(c) adds the vertical interactions and the accompanying sparsity pattern again manifests itself along diagonals (Figure 1(d)). Finally, with diagonal nearest neighbours being horizontal neighbors of vertical elements, we observe a "thickening" of the diagonals in Figure 1(f). Guided by these considerations we combine generalized Yule-Walker estimation with a sparse group penalty (e.g. Simon et al., 2013). The Yule-Walker estimator will control for endogeneity, while the sparse group penalty will shrink towards diagonal structures by including/omitting complete diagonals and thus selecting the required interactions. Compared to Gao et al. (2019), we hereby gain the ability to exploit sparsity within banded matrices.
A formal definition of our estimator requires further notation. Part of this notation comes naturally if we briefly review the generalized Yule-Walker estimator. After post-multiplying by y t−1 and taking expectations, we find Σ 1 = AΣ 1 + BΣ 0 or, equivalently, The i th column of C contains all coefficients that belong to the i th equation in (1). Assumption 3 requires several of these coefficients to be zero so we exclude these from the outset. We collect all remaining (possibly) nonzero coefficients in the i th equation in the vector c i , and define V i   (4) We will adjust this objective function in three ways. First, we define our estimator in terms of banded estimated covariance matrices, which allows us to exploit the results in Theorem 1.
Second, our group penalty penalizes parameters across equations so we can no longer estimate the parameters equation-by-equations. We therefore defineσ h = vec (B h (Σ 1 ) ) andV , withV i,h being constructed similarly to V i (see Figure 2 for an illustration). 3 In this notation, the expression σ h −V (d) h c 2 2 defines the joint objective function that sums the individual contributions in (4) over all equations. Finally, we construct the penalty function. We define an index set that partitions the vector c into sub-vectors, denoted {c g }, that contain the non-zero diagonals of A and B that are admissible under Assumption 3 as respectively, and let G = G A ∪ G B . Based on this notation, we define our objective function as The spatial lasso-type shrinkage estimator, abbreviated SPLASH(α,λ) or SPLASH in short, is defined as the minimizer of (6), i.e.ĉ = arg min c L α (c; λ). The importance of the penalty function P α (c) is governed by the penalty parameter λ and the second hyperparameter α balances groupstructured sparsity versus individual sparsity. At the extremities of α ∈ [0, 1] we find the group lasso (α = 0) and the lasso (α = 1). Intermediate values of α will shrink both groups of diagonal coefficients in A and B and individual parameters. The SPLASH solution promotes completely sparse diagonals and sparse elements within nonzero diagonals, and thus shrinks towards sparsity patterns of the type displayed in Figure 1(b). As the structure of our estimator is similar to that of the Sparse Group Lasso (SGL), efficient algorithms are available to compute its solution (see, e.g. Simon et al., 2013). An R/C++ implementation of the SPLASH estimator based on this algorithm is available on one of the author's website. 4 .

Theoretical Properties of the SPLASH(α,λ) Estimator
In this section we derive the theoretical properties of the SPLASH estimator. First, however, we require an additional assumption on the DGP in order to ensure that A and B in (1) are uniquely identified. To this end, we leverage the bandedness assumption in Assumption 3, which enables unique identification of A and B via a straightforward full-rank condition on sub-matrices of the autocovariance matrices that appear in the generalized Yule-Walker equations.
Assumption 4 (Restricted minimum eigenvalue). Assume that φ min (x) := min Assumption 4 states that every sub-matrix containing N columns from V has full column-rank and a minimum singular value bounded away from zero. Related assumptions appear in Bickel et al. (2009, Section 4), who refer to φ min (x) as a restricted eigenvalue and use this quantity to construct sufficient conditions for their restricted eigenvalue assumptions. Assumption 4 fits our framework particularly well, as the assumed maximum bandwidth of the matrices A and B in Assumption 3 imply that the diagonal blocks of the matrix V (d) never contain more than N unique columns of V .
Using this property, we show in Lemma 1 of Appendix A that a Sparse Group Lasso compatibility condition is implied by Assumption 4.
Equipped with Assumption 4, we find the following finite-sample performance bounds on the prediction and estimation error of SPLASH.
Theorem 2 contains a finite-sample performance bound on the prediction and estimation error for the SPLASH(α,λ) estimator. It offers some interesting insights. First, we focus on the probability with which inequality (7) holds. For VAR estimation with a penalized least-squares objective function, such probabilities are governed by tail probabilities of the process { 1 T T t=1 y it jt } (see, e.g. lemma 4 in Kock and Callot (2015), or lemmas 5-6 in Medeiros and Mendes (2016)). Because Yule-Walker estimation relies primarily on autocovariance matrix estimation, our probability depends on the tail decay of the distribution of { V h − V }. Overall, the probability of (7) improves through faster tail decay of the innovation distribution (compare cases (a) and (b)) and banded autocovariance matrix estimation (Theorem 1). Second, we look closer at the performance upper bound itself. The right-hand side of (7) demonstrates that the upper bound of the prediction and estimation error is increasing inω α , which in turn is increasing in the bandwidths k 0 and l 0 , increasing in the group sizes (α < 1), and increasing in the number of relevant interactions |S| (α > 0). Furthermore, the prediction and estimation error increases in the degree of penalization.
Whereas this seemingly suggests to minimize λ as to improve performance bounds, we emphasize that the effect of regularization in Theorem 2 is two-fold: increasing regularization deteriorates the performance bound, but increases the probability of the set on which the performance bound holds. Intuitively, shrinkage induces finite-sample bias which worsens accuracy, but simultaneously reduces sensitivity to noise, thereby enabling performance guarantees at higher degrees of certainty.
The aforementioned effects can also be demonstrated by means of an asymptotic analysis. Based on Theorem 2, we derive the conditions for convergence of the prediction and estimation errors in the following corollary. The exact convergence rates are also provided.
where q λ , q N , q s , and q k are fixed and positive constants. Maintain Assumptions 1-4 and assume for some 0 < δ < 1 and Assumption 2(b1) holds, or (ii) q λ < 1 2 − q k and Assumption 2(b2) holds. Then, Corollary 1 provides insights into the determinants of the convergence rate. In particular, the result confirms that the convergence rate decreases in the bandwidths k 0 and l 0 , the number of spatial units N , the number of interactions |S| and the degree of penalization λ. 5 To ensure that the set on which the performance bound in Theorem 2 holds occurs with probability converging to one, conditions (i) and (ii) impose that the degree of penalization does not decay too fast. The optimal convergence rate is obtained by choosing q λ as large as possible without violating these conditions. Some concrete examples are provided in Remark 2.
Remark 2. Insightful special cases can be examined based on Corollary 1. For the sake of brevity, we consider two cases while focusing on the estimation error P α (ĉ − c) and assuming errors with at least d finite moments (Assumption 2(b1)). In the absence of within-group shrinkage (α = 0), by choosing δ arbitrarily close to 1. The estimator now converges almost at rate T 1/2−1/2d N 1+2/d |G S | . For fixed N and large d, this is close to the common √ T -rate of fixed-dimensional settings without regularization. If shrinkage is imposed at the individual interaction level only (α = 1), then P 1 (ĉ − c) = O p (T qs−q λ ) and the estimation error converges almost at the rate T 1/2−1/2d |S|N 2/d . Noting that N 1+2/d |G S | > |S| N 2/d , we see that SPLASH(1,λ) attains a convergence rate at least as fast SPLASH(0,λ), and possibly faster when the sparsity is unstructured or the diagonals are highly sparse.
Each vector x t,k = (x 1t,k , . . . , x N t,k ) augments the spatio-temporal vector autoregression with an extra regressor. This regressor may vary over time and it is exogenous, i.e. we have E(x t,k t ) = O for k = 1, . . . , K. For notational brevity, we consider the situation in which the exogenous regressors 5 Recall that λ ∈ O T −q λ , such that a higher q λ implies a faster decay of the penalty term.
x it,1 . . . , x it,K can only directly influence spatial unit i. This explains the diagonal structure in diag(β k ). In Remark 4 we argue that this simplification does not greatly hinder generality. In contrast to Ma et al. (2021), we allow β k = (β 1k , . . . , β N k ) to vary with location. We keep K fixed.
To account for the exogenous variables, we modify the generalized Yule-Walker estimator of Section 3.1. We recall Σ j = E(y t y t−j ), and define the matrices . Two sets of Yule-Walker equations, namely are derived by post-multiplying the model by respectively y t−1 and x t,k , and taking expectations.
Compared to (3), the Yule-Walker equations in (9a) contain the additional term independent and β k = 0), then (9a) alone will not identify β k . We therefore add the additional Yule-Walker equations in (9b). To develop the estimator, we combine (9a) and (9b) into From this point onward, the development of the SPLASHX(α,λ) estimator mimics the reasoning of page 10 closely. First, we focus on the i th spatial unit and collect all the nonzero coefficients of A and B (as stipulated by Assumption 3) in c i . Letting V * i denote the columns in V * related to c i and defining both σ (10) implies This objective function allows for the estimation of β 1 , . . . , β K , sparse coefficients, completely sparse vectors β k , and completely sparse diagonals in the coefficient matrices A and B. There is a clear mathematical resemblance between the SPLASH and SPLASHX estimators. Accordingly, under appropriate modifications to Assumptions 1-4, a finding similar to Theorem is attainable. We provide this result as Theorem 3 and refer the reader to Supplement C for detailed assumptions and proofs.
Under Assumptions 1*-4* and Q ≤ C Q , it holds that for some 0 < δ < 1, and All constants (b 1 , b 2 , κ 1 , etc.) are positive and independent of N and T , see Theorem 1.
Remark 3. The inclusion of exogenous variables affects the autocovariance structure of the data.
Clearly, E(y t y t ) now also depends on the various second moments of the exogenous covariates. We do not make any a priori assumptions on E(x t,k x t,κ ) and thus define the SPLASHX(α,λ) in terms of the unbanded autocovariance matrix estimators.
Remark 4. Defining the coefficient matrix in front of x t,k as diagonal is not restrictive. That is, by letting x t,k+1 be a reordered version of x t,k , the former's addition to the model can accommodate for the situation in which the dependent variable is influenced by the exogenous variable x t,k from multiple locations.

Simulation setting
In this section, we explore the finite sample performance of our estimator by Monte Carlo simulation.
The data generating process underlying the simulations is the spatio-temporal VAR in (1). We study T ∈ {500, 1000, 2000} and draw all errors it independently and N (0, 1) distributed. The matrices A and B and the cross-sectional dimension N are specified in the two designs below. All simulation results are based on N sim = 500 Monte Carlo replications.
Design A (Banded specification): We revisit simulation Case 1 in Gao et al. (2019). The matrices A and B are banded with a bandwidth of k 0 = 3. Specifically, the elements in the matrices (A) N i,j=1 and (B) N i,j are generated according to the following two steps: Step 1: If |i − j|= k 0 , then a ij and b ij are drawn independently from a uniform distribution on the two points {−2, 2}. All remaining elements within the bandwidth are drawn from the mixture distribution ωI {0} + (1 − ω)N (0, 1) with P(ω = 1) = 0.4 and P(ω = 0) = 0.6.
Step 2 For each design, we report simulation results for three sets of estimators. The first set includes the estimators developed in this paper: (1) the SPLASH(0,λ) estimator promotes non-sparse groups only, (2) SPLASH(0.5,λ) provides equal weight to sparsity at the group and individual level, and it is given a comparative advantage. The third set solely contains the L 1 -penalized reduced form VAR(1) estimator (abbreviated PVAR). In detail, we consider the reduced form VAR (1) specification y t = Cy t−1 + u t and estimate C by minimizing L pvar (C) = T t=2 y t − Cy t−1 2 2 + λ N i,j=1 |c ij |. This estimator is well-researched in the literature (see, e.g. Kock and Callot, 2015;Gelper et al., 2016;Masini et al., 2019), albeit in different settings. It will serve as a competitive benchmark for the forecasting performance of our proposed estimation procedure.
The forecasting performance of each estimator will be assessed using the Relative Mean-Squared Forecast Error (RMSFE). Using a superscript j to index a specific Monte Carlo replication, the RMSFE is calculated as As the SPLASH and GMWY procedures estimate A and B, we can also compare the estimation accuracy. Using the superscript j as before, the Estimation Error (EE) of the coefficient matrices 7 The choice for α = 0.5 is solely made to illustrate the effect of combining both group and individual penalties.
For different designs, this choice may or may not be optimal. 8 In unreported simulation results (available upon request), we find that the results are insensitive to this choice.

are EE
Finally, a word on the selection of the the penalty parameter. For the SPLASH estimator, we calculate the maximum penalty, λ max , as the smallest value producing the zero solution for all values of α, i.e.
Given λ max , we define the smallest penalty λ min as 10 −4 λ max (10 −6 λ max ) for Design A (B) and construct Hyndman and Athanasopoulos, 2018). In our implementation of TSCV, the first 80% of the data is used to fit multiple solutions on, which are then evaluated based on the MSFE obtained on the latter 20% of the data. The preferred penalty is chosen as the solution that attains the smallest MSFE. 9

Simulation results
The results for Design A are reported in Table 1. First, we consider the predictive performance in Panel 1. For all methods, we observe a monotonic decrease in RMSFE when T increases. The SPLASH estimators and GMWY(k 0 ) exhibit the best overall forecast performance, with SPLASH outperforming for smaller sample sizes (T = 500). Among the SPLASH estimators, SPLASH(0,λ) attains the lowest RMSFE in the majority of specifications but differences are generally marginal. 9 We also tried to select the penalty for the PVAR as the sparsest solution whose prediction error lies within one standard error of the minimum prediction error. This selection rule, however, did not lead to an improvement in forecast or estimation accuracy.  (12) and (13), respectively. In general, lower numbers indicate better performance. As PVAR estimates a reduced form VAR, there are no model errors for A and B to report for this method.
The penalized VAR forecasts are less accurate than the aforementioned methods. An explanation is that sparsity patterns in the reduced form representation are less prevalent and thus more difficult to exploit. Direct estimation of the contemporaneous spatial interactions thus delivers forecast improvements over regularized reduced form estimation. The GMWY estimator is highly competitive when T = 2000 but performs notably worse for small N and T . As GMWY has a tendency to select a too large bandwidth (as in Gao et al. (2019), table 1), this is probably caused by the estimation of redundant parameters. Given that the majority of sparsity in this design comes from the small bandwidth of A and B, which is fully exploited by the infeasible GMWY(k 0 ) estimator, we consider it reassuring that the SPLASH estimators attains comparable, and occasionally better, forecast performance without necessitating an a priori specification of the bandwidth.
Next, we explore the estimation accuracy for A and B in Panels 2 and 3, respectively. As before, all estimators display an improvement in estimation accuracy when T increases. The SPLASH(0,λ) attains a lower estimation error than the SPLASH(0.5,λ) estimator, which in turn performs better than the unstructured sparsity variant SPLASH(1,λ). The tight bandwidth in this design implies that many diagonals ought to be set to zero, which seems to be best effectuated by means of the group penalty. The GMWY(k 0 ) estimator appears to deliver somewhat more accurate estimates than SPLASH for larger values of T . This apparently slower convergence of the SPLASH estimator might, at least partly, be considered the price of not knowing the true sparsity pattern, as represented by the termω α in Theorem 2. It is worth mentioning, however, that the choice of penalty parameter is motivated based on the predictive performance, which may not be optimal from the perspective of estimation accuracy. Indeed, in an unreported analysis we find that the penalty that minimizes the estimation error is typically higher and delivers sparser solutions. Regarding the GMWY estimator, we note that the detrimental effect of overestimating the bandwidth in smaller sample sizes is again visible, with the estimation error being substantially larger for the T = 500 setting.
Simulation results for Design B are shown in Table 2. The high RMSFEs for the GMWY estimators are most striking. In the setting N = 25 and T = 500, the GMWY estimator frequently selects a bandwidth equal to 1, translating to inferior performance across all metrics. The GMWY(k 0 ) estimator, on the other hand, is based on the correct bandwidth. This method, however, forecasts far worse, while its estimation accuracy instead is competitive to SPLASH. Upon closer inspection, we find that the high RMSFE in this case is driven by a few extreme prediction errors. These prediction outliers in turn correspond to simulation trials in which the smallest absolute eigenvalues of the estimated matrix I −Â are close to zero (see Fig S1 in the Supplementary Appendix). This implies that the GMWY estimator may be prone to stability issues when the bandwidth is large relative to the dimension. 10 Apparently, owing to the implementation of sparsity, the SPLASH estimator does not suffer from such stability issues. For N = 25 and T = 2, 000, the bandwidth  (12) and (13), respectively. In general, lower numbers indicate better performance. As PVAR estimates a reduced form VAR, there are no model errors for A and B to report for this method.
selection in GMWY improves, while its forecast performance ironically worsens as a result of the increasing stability issues. The remaining results tell the same story as in Design A; SPLASH(0,λ) and SPLASH(0.5,λ) are forecasting very close to the optimal forecast, and forecast notably better than the PVAR. While the forecast performance of SPLASH(1,λ) comes across as equivalent to the SPLASH implementation with group-penalization, the estimation accuracy is superior for the latter. Hence, the group penalty seems especially valuable for the purpose of model interpretation.
A small visual analysis provides further evidence on the favourable estimation accuracy obtained by SPLASH with shrinkage towards diagonally structured sparsity. We visualize the capability of recovering the correct sparsity pattern by displaying the absolute value of the coefficients as averaged across all N sim simulation runs. Figure 3 illustrates the similarity between the true matrix A and the average magnitude of the estimated coefficients.
Remark 5. In elaborate, though unreported, visual analysis, we find that most zero diagonals are actually not estimated as exactly zero by the (sparse) group lasso. When tuning the penalty parameter by the BIC criterion, in which the number of estimated non-zeros is used as a proxy for the degrees of freedom, the true zero diagonals are typically estimated as exactly zeros. However, the increased amount of regularization that is required to effectuate this is detrimental to the forecast performance.

Simulations with exogenous regressors
In this section, we examine the estimation performance of our estimator in the presence of exogenous regressors. Simulated data is drawn from where A and B are generated analogously to designs A and B in Section 4.1, and the coefficients of the exogenous regressors are given by β 1 = ι N and β 2 = 0 N . Hence, only x t,1 contributes to the variation in y t . Accordingly, we henceforth refer to x t,1 and x t,2 as the relevant and irrelevant exogenous regressor, respectively. All elements of exogenous variables and innovations are drawn i.i.d. from N (0, 1). At each simulation trial, we implement the same estimators as considered in  (1) representation, y t = By t−1 + [D diag(β 1 )]x t,1 + [D diag(β 2 )]x t,2 + D t , the coefficient matrices in front of x t,1 and x t,2 are no longer diagonal. This would cause an unfair comparison with PVAR so we decided to omit the penalized VAR approach from the comparison. Second, to avoid results that depend on the prediction method employed, we focus solely on the estimation accuracy.
The estimation accuracy for the estimates of A and B is compared on the basis of the metrics EE A and EE B , as given in (13). In addition, we also report the average estimation errors of the coefficients for the relevant and irrelevant exogenous regressors, calculated as The results for Design A and B are reported in Tables 3 and 4, respectively.
First, we consider the results for Design A. The first two panels display the average estimation errors in A and B, respectively. Reassuringly, all estimators display a clear monotonic decrease in estimation accuracy with growing sample size. Comparing the SPLASH estimators among each other, we observe that shrinkage towards group sparsity is most beneficial in high-dimensional settings (T = 500 or N = 100). In these instances, the SPLASH(0,λ) and SPLASH(0.5,λ) estimators obtain the lowest estimation error across all methods. Conversely, when the dimension is small (N = 25) and sample size is large (T = 2000), we find little gain in penalizing towards structured sparsity and the SPLASH(1,λ) outperforms all other methods. Furthermore, the SPLASH estimators attain a lower estimation error than the GMWY estimators for almost all settings, with the performance gains attained by SPLASH being most pronounced in the case where the sample size is small, i.e. when the exploitation of sparsity matters most. Comparing the GMWY estimators, we find that, in lower-dimensional settings, using a data-driven selection of the bandwidth performs comparable to relying on the true bandwidth. However, when N = 100 and T = 500, we find that the bandwidth selection procedure over-estimates the true bandwidth in roughly 40% of the simulation trials. Accordingly, the GMWY estimator attains inferior estimation accuracy in this particular setting. Regarding the exogenous regressors, we observe a similar monotonic decrease in the estimation error for the coefficients of both the relevant and irrelevant exogenous regressor.
The SPLASH estimator outperforms the GMWY estimators across all dimensions and sample sizes, with the performance gain again being most prominent in the higher-dimensional settings. The estimation error obtained by SPLASH for the irrelevant exogenous regressor is remarkably small, further demonstrating the benefits of the incorporated shrinkage.
The results for Design B depict a similar, if not more compelling, story. The SPLASH estimators again outperform across all settings, with the performance differentials between SPLASH and GMWY being more pronounced compared to Design A. Again, we observe that the SPLASH(1,λ)  estimator seems to outperform based on EE A and EE B for N = 25, whereas the SPLASH(0,λ) and SPLASH(0.5,λ) estimators do better when N = 100. We conjecture that shrinkage towards structured sparsity only becomes beneficial when the group sizes are sizable enough, at which point the accumulation of selection errors by SPLASH(1,λ) starts to deteriorate the overall estimation accuracy. Contrasting the performance of SPLASH to the GMWY estimators, we observe that the exploitation of sparsity within the bandwidth results in substantial performance gains across all specifications and coefficient matrices. Moreover, the bandwidth selection procedure of the GMWY estimator now frequently selects very small bandwidths. This negatively impacts the estimation accuracy when N = 25, whilst having a positive impact when N = 100. In the latter case, the number of parameters to estimate is simply too large without further regularization, such that one might be better off by forcing most diagonals to zero, even if some of those are relevant. Interestingly, the inability to exploit sparsity also affects the estimation accuracy for the relevant exogenous regressors, as the third panel reveals a sizeable difference in the EE R between the SPLASH and GMWY estimators. Regarding the irrelevant exogenous regressor, we find that the EE IR is substantially larger for GMWY, but comparable across the SPLASH and GMWY(k 0 ) estimators.
Overall, SPLASH unambiguously attains more accurate estimates of all coefficient matrices in the spatial VAR with exogenous regressors. In line with expectations, the performance gain of SPLASH is most notable in high-dimensional designs with substantial degrees of sparsity. However, even in lower-dimensional designs in which the degree of sparsity is less, SPLASH remains competitive to the GMWY estimators.

Empirical Application
Nitrogen dioxide (NO 2 ) is emitted during combustion of fossil fuels (e.g. by motor vehicles) and it has been associated with adverse effects on the respiratory system. 11 The Air Quality Standards Regulations 2010 requires a regular monitoring of NO 2 concentration levels in the UK. 12 Using satellite data, we examine the empirical performance of the SPLASH estimator when predicting daily NO 2 concentrations in Greater London. This satellite data is publicly available via the Copernicus Open Access Hub and we consider the time span from 1 August 2018 to 18 October 2020. 13 The original NO 2 concentrations are reported in mol/m 2 , which we convert to mol/cm 2 to avoid numerical instabilities caused by small-scale numbers. The far majority of measurements are captured between 11:00 and 14:00 UTC. The area of interest is divided into a (5 × 9) grid, implying that longitudes and latitudes increment by approximately 0.2 from cell to cell (see Figure 4, part 11 The direct health effect of nitrogen dioxide is difficult to determine because its emission process is typically accompanied with the emission of other air pollutants (see, e.g. Brunekreef and Holgate (2002) c). All available NO 2 measurements are averaged within each cell and within the same day. The resulting data set contains 0.8% missing observations, which we impute using the Multivariate Time Series Data Imputation (mtsdi) R package. 14 A rolling-window approach is used to assess the predictive power of the SPLASH estimator. Each window contains 80% of the data (641 days) allowing 160 one-step ahead forecasts to be made. For each window, we proceed along the following four steps: (i) de-mean the data, (ii) determine the hyperparameters and estimate each model, (iii) produce a forecast for the de-meaned data, and (iv) add the means back to the forecast. In addition to the estimators described in the simulation section (Section 4, see page 16), we add another forecast: the window's mean. This new forecast is abbreviated CONST and all other forecasts follow the notational conventions from the simulation section. For SPLASH, we follow the procedure described in Section 4.1 and set λ = 1.8 × 10 −4 λ max .
The spatial grid contains N = 5 × 9 = 45 spatial units, such that the SPLASH(α,λ) models contain 2N 2 − N = 4, 005 parameters. For the purpose of identifiability, we band the spatial matrix A and autoregressive matrix B such that a ij = b ij = 0 for |i − j| > N/4 = 11. By ordering the spatial units vertically, this banding puts no restrictions on the vertical interactions but allows no more than second-order interaction between horizontal neighbours (see Figure S2 in the Supplementary Appendix D.2 for details).
The forecast performance is measured along three metrics and is always expressed relative to the L 1 -penalized reduced form VAR(1) (PVAR) benchmark. That is, we report: (i) the number of spatial units that are predicted more accurately than the PVAR method (#wins), (ii) the number of spatial units that are predicted significantly more accurately based on a Diebold-Mariano test at a 5% significance level (#sign. wins), and (iii) the average loss relative to the penalized VAR over all spatial units. These three metrics are calculated based on two loss functions for the forecast errors, namely the mean squared forecast error (MSFE) and the mean absolute forecast error (MAFE).
We additionally report the MAFE because the NO 2 column densities display several abrupt spikes which may carry too much weight when relying on a squared loss function. The results are reported in Table 5.
We first look at the mean squared forecast errors (MSFEs). The window-mean forecast (CONST) clearly does not improve the benchmark PVAR forecast for any spatial unit. However, this forecast still attains a RMSFE of 1.185, potentially indicating a low predictability of NO 2 column densities.
The GMWY approach obtains the worst forecast performance, possibly because a large bandwidth is needed to allow for second-order horizontal interaction and, consequently, a large number of parameters to estimate. With an RMSFE of 0.919, SPLASH(0,λ) does manage to improve upon the benchmark. In fact, the MSFEs for all 45 spatial units are smaller than that of the benchmark, 42 of which are found to be significant by a Diebold-Mariano test based on the squared forecast errors. Allowing for sparsity within groups does not seem to deliver additional forecast improvements, as SPLASH(0.5,λ) attains a slightly worse forecast performance and significantly outperforms the benchmark for only 39 locations. Completely omitting regularization at the group level results in a further deterioration of the forecast performance, with SPLASH(1,λ) attaining an RMSFE of 0.94 and significantly beating the benchmark for 30 out 45 spatial units. We take this as evidence that the ability to promote diagonally structured sparsity is indeed beneficial in real-life spatial applications, although even estimating the spatial VAR with unstructured sparsity manages to improve upon regularized reduced form VAR estimation.
Next, we focus on the mean absolute forecast error (MAFE). The results are qualitatively similar to those obtained based on the MSFE. In particular, GMWY still has the worst forecast accuracy for GMWY and SPLASH(0,λ) continues to perform best. The GMWY method, while still standing out, does not score as poorly anymore based on the RMAFE. We conjecture that the absence of regularization may increase sensitivity to noise, thereby resulting in particularly high squared forecast errors at periods of atypical NO 2 concentrations.
Finally, we illustrate the second key benefit of SPLASH-type estimators: interpretability. Recall that we convoluted satellite images to a (5 × 9) grid of spatial units. To examine the relevant interactions between these spatial units, we provide several visualizations off the spatial weight matrices estimated by the SPLASH(0.5,λ) estimator. First, in Figure 4(a), we visualize the absolute magnitude of the spatial interactions. A clear diagonal pattern emerges, with the two diagonals closest to the principal diagonal and the two outer diagonals containing the largest interactions. These four diagonals correspond to first-order vertical and second-order horizontal interactions, respectively. The additional two diagonals, that are sandwiched in between the former, contain the first-order horizontal interactions between spatial units, which surprisingly seem to be smaller in magnitude. In Figure 4(b), each cell indicates the proportion of rolling windows the corresponding spatial interaction is estimated as being non-zero. These proportions are either one (yellow) or zero (purple) indicating very stable selection across samples. It becomes apparent that in addition to the six diagonals that were clear from panel (a), two additional diagonals are always selected, which contain the first order diagonal interactions between spatial units. To facilitate interpretation of this sparsity pattern, we provide a spatial plot of our region of interest with the spatial grid overlaid (Figure 4(c)). We explicitly visualize the interactions implied by 4(b) for two pixels -pixel 1 (lefttop) and pixel 23 (center) -using arrows whose thickness is determined by the average absolute magnitudes estimated in Figure 4(a). The emerging pattern of spatial interactions shows clearly the interactions between NO 2 concentrations of neighbouring districts in London. The wider horizontal interactions, as well as the diagonal interactions, may be explainable by the "prevailing winds", which come from the West or South-West and are the most commonly occurring winds in London.
Overall, the intuitive sparsity patterns that arise, in combination with the improvement in forecast performance, are encouraging and provide empirical validation for the use of SPLASH on spatial data, especially when the spatial units follow a natural ordering on a spatial grid.

Conclusion
In this paper, we develop the Spatial Lasso-type Shrinkage (SPLASH) estimator, a novel estimation procedure for high-dimensional spatio-temporal models. The SPLASH estimator is designed to promote the recovery of structured forms of sparsity without imposing such structure a priori. We derive consistency of our estimator in a joint asymptotic framework in which the number of both spatial units and temporal observations diverge. To solve the identifiability issue, we rely on a relatively non-restrictive assumption that the coefficient matrices in the spatio-temporal model are sufficiently banded. Based on this assumption, we consider banded estimation of high-dimensional spatio-temporal autocovariance matrices, for which we derive novel convergence rates that are likely to be of independent interest. The SPLASHX extension explains how to include exogenous variables. As an application, we use SPLASH to predict satellite-measured NO 2 concentrations in London. We find evidence for spatial interactions between neighbouring regions. In addition, our   Then, under Assumption 4, it holds that Proof. First, we show that (16) is bounded from below by 0.5 times the smallest singular value of V (d) .
Next, the result in Lemma 1 follows by noting that V (d) = diag (V1, . . . , VN ) is a block-diagonal matrix whose singular values correspond to those of its sub-blocks. Let Ni denote the number of columns of Vi and note that Ni < N by construction. Then, where (i) follows since Ni < N for all i = 1, . . . , N and (ii) holds by Assumption 4.
Proof. First, recall the construction ofV Then, where the last inequality follows holds on the set V φ 0 4 . Consequently, where (i) follows from the proof of Lemma 1, (ii) holds by (18), and (iii) from Lemma 1.

Appendix B Proofs of Main Results
Proof of Theorem 1. We first prove various intermediate results, see (a)-(d) below. We afterwards combine these results and recover Theorem 1.
We now bound Σ r,s 0 − Σ0 1 . Assumption 1(b), imposes C ≤ δC for some 0 ≤ δC < 1. Because An inspection of (19) shows that additional upper bounds are required on C j s , C j − C j s , and ∞ j=r+1 C j DΣ D (C ) j . For C j s , using properties of matrix norms and Theorem 1(a), we obtain Furthermore, expanding the matrix powers provides Returning to (19) and inserting the upper bounds in (20)- (22), we find Assuming s is sufficient large, that is assuming C1δ s A + δC < 1, we subsequently use result on geometric series and conclude 16

The claim is thus indeed valid with
Now combine Cs ≤ C1δ s κ + δC ≤ 1 (since s is taken sufficiently large), Theorem 1(b), (21) with j = 1, We have because Σ r,s 0 is a banded matrix already and banding can only decrease the norm difference between Σ r,s 0 and Σ0. We consider the three terms in the RHS of (24) separately. There are at most 2h1 + 1 nonzero elements in any column/row of B h 1 ( Σ0 − Σ0) and thus and where the last inequality exploits Lemma S3 (see Supplement). Note that the probabilities in (26) where the last inequality follows from part (b) of this proof. A lower bound on the probability of B h 1 ( Σ0 − Σ0) ≤ occurring is immediately available from (26).
Lemma S4 implies that It remains to determine s and r such that . Under the latter assumption, we determine the s such that Similarly, for r, the choice r = log(C4/ )/2|log(δC )| suffices.
Proof of Theorem 2. The proof of the theorem relies on the properties of dual norms. Recall Pα(c) = α g∈G |g| cg 2 + (1 − α) c 1 . Exploiting the properties of · 1 and · 2 , it is straightforward to verify that Pα(·) is a norm for any 0 ≤ α ≤ 1. For any norm · , we define its dual norm · * through c * = sup x =0 |c x| x . The dual-norm inequality states that c x ≤ c * x for all conformable vectors c and x.
For the norm Pα(c), its dual norm P * α (c) is bounded by by convexity of the function f (x) = x −1 in step (i), and using for step (ii) both c * 1 = c ∞ and We now start the actual proof.

Recalling the objective function
+ λPα(c) and noting that Lα(ĉ; λ) ≤ Lα(c; λ) by construction, it follows that where we used the dual-norm inequality (see (27)) and the triangle property of (dual) norms in the second and third inequality, respectively. Define the sets On the set H1( λ 4 ) ∩ H2( λ 4 ), we can scale (30) by a factor 2 to obtain We subsequently manipulate Pα(ĉ) and Pα (ĉ − c). Using the reverse triangle inequality, we have where GS and GSc are defined in Lemma 1. Simple rewriting provides Combining results (32)-(34) yields andĉ − c is thus a member of the set CN c (G , S) as defined in Lemma 1.
A straightforward rearrangement of (36) provides the inequality of Theorem 2.
It remains to determine a lower bound on the probability of H1( λ 4 ) ∩ H2( λ 4 ) ∩ V φ 0 2 . We rely on the elementary inequality P H1( λ bound the individual probabilities.
We start with P H1( λ The last inequality is true because continuing from (28), we have for any vector c. Subsequently, we have exploiting block-diagonality ofV where the last line relies on the union bound and the fact that c ∞ < 1 (implied by Assumption 1). Hence, the sets H1 λ 4 c and H2 λ 4 c admit the same probability bound.
Proof of Corollary 1. First, we derive the conditions under which the set on which the performance bound in Theorem 2 holds occurs with probability converging to one. Under Assumption 2(b1), along with the remaining assumptions in Theorem 2, this probability is given by 1 − P1(f (λ, φ0), N, T ), where we recall from Theorem 1 that Given that λ ∈ O T −q λ with q λ > 0, it follows immediately that In addition, following the remark below Theorem 1, it holds that Based on (42) and (43), it follows that the first RHS-term in (41) converges to zero exponentially in T for any δ < 1. The second RHS-term, however, converges to zero at most at a polynomial rate. Accordingly, where the second equality holds from the observation that for any δ < 1 the first two RHS terms in the first equality converge to zero at an exponential rate, whereas the third term may converge to zero at most at a polynomial rate. From (44) In a similar fashion, we derive the conditions under which P2(f (λ, φ0) → 0, by noting that [2h(f (λ,φ 0 ))+1] 2 diverges at a polynomial rate in T . Making use of (42) and (43), it follows that which translates to the condition 1 − 2q λ − 2q k > 0, or q λ < 1 2 − q k . This establishes conditions (i) and (ii) in Corollary 1.
We proceed by deriving the order of the performance bound in Theorem 2. Noting that Then, by Theorem 2,

A Auxiliary Lemmas related to Theorem 1
Note: Lemmas S1-S3 are strongly related to Masini et al. (2019). That is, small differences occur because there are additional constants that originate from rewriting our Spatial VAR into reduced form. To keep track of all constants, and in order to have all results in the notation of the main paper, we decided to keep the full derivations in this Supplement.
Lemma S1 (Moment and Truncation Bounds). Define y (k) t by truncating the VMA(∞) representation for y t at the k th lag, that is y under Assumptions 1 and 2.
Proof. (a) We have y it = e i y t . Using the VMA(∞) representation of the spatio-temporal vector autoregression, we have for any Orlicz norm by the norm properties of the Orlicz norm, Assumption 1(a)-(b), and Assumption 2(b).
N k=1 e i C j De k k,t−j ψ and continue as in (S1). (c) A small variation on (S1) provides The proof is complete.
For all j = 1, . . . , p, the elements of Σ j = 1 T T t=p+1 y t y t−j can all be expressed as 1 T T t=p+1 ξ it ξ jt after an appropriate choice of (i, j). We will look at T t=p+1 ξ it ξ jt (so ignoring the multiplicative factor 1 T ).
(a) We have (c) We have 1 T E|S 2T |≤ I T + II T + III T + IV T with (d) If Assumptions 1 and 2(b1) (polynomial tails) hold, then we have: (e) If Assumptions 1 and 2(b2) (subexponential tails) hold, then we have: Proof. (a) Start from ξ it ξ jt = ξ it ξ jt 1 {|ξitξjt|<C} + ξ it ξ jt 1 {|ξitξjt|≥C} and develop the first term as Subsequently, combine these results and sum over t = p + 1, . . . , T . (b) We have where the second line follows from an application of the Azuma-Hoeffding inequality (see, e.g. corollary 2.20 in Wainwright (2019)). (c) The result requires the use of the triangle inequality and some subtracting and adding of identical terms: The indicated terms coincide with those defined in Lemma S2(c). (d) For any random variable X with X d < ∞ (d > 1), it holds that using Hölder's and Markov's inequality in the first and second inequality, respectively. By the conditional Jensen inequality, (S3), Cauchy-Schwartz and Lemma S1, we can indeed bound I T as For II T , the triangle and the conditional Jensen's inequality give and its RHS is further bounded as The last inequality uses the results in Lemma S1 based on the Orlicz norm induced by ψ(x) = x 2 . As jt − ξ it ξ jt (triangle inequality), the bound for III T follows immediately from (S4). Finally, for IV T , we have jt } is α-mixing with exponentially decaying mixing coefficients α (k) m , we will use theorem 14.2 from Davidson (1994) with p = 1 and r = 2: 1 (e) We now use · ψ to denote the Orlicz norm induced by the function ψ(x) = exp(x) − 1. We first derive an upper bound for E ξ it ξ jt 1 {|ξitξjt|>C} . By the Cauchy-Schwartz inequality, we find jt is a weighted linear combination of t, t−1 , . . . , t−p−k+1 . According to theorem 14.1 in Davidson (1994), the mixing coefficients of ξ The moment bound on page 95 of van der Vaart and Wellner (1996) and the tail bound as in Exercise 2.18 in Wainwright (2019) jt 1 {|ξitξjt|>C} is the same and follows along the same steps. Overall, we have For II T , III T and IV T , we use the same methods of proof but express the moments in terms of µ ∞ , e.g. µ 2 ≤ 2! µ ∞ . The claims are immediate.
Proof. The starting point is Upper bounds on I T -IV T in the RHS of (S6) are already available in Lemma S2. It remains to decide on: the tail cut-off C, the number of terms in the finite order VMA approximation k, and the number of martingales differences that approximate the mixingale m. (a) From Lemma S2(d), we obtain where we assumed that p is not too large, more specifically p ≤ d−1 d m (thus m d ≤ m−p). If we match the exponents (to get the fastest decay rate), then we take γ c k = γα 2 ( m d − k), or k = 1 d γα γα+2γc m. Defining a 1 = 2(µ 2d c D c 1 ) 2d , a 2 = 4(µ 2 c D c 1 ) 2 + 6c 2 (µ 4 c D c 1 ) 2 , and a 3 = γc d γα γα+2γc , we conclude from (S6) that Again matching exponents, we get m = 1 2 T 2 a3C 2 1/3 and hence Set C = T δ/2 for some 0 < δ < 1 to complete the proof.
(a) Let A ∈ R N ×M1 and B ∈ R N ×M2 and construct C ∈ R N ×(M1+M2) as C = [A B]. We have Proof. (a) By definition, we have C 2 2 = λ max (C C). The nonzero eigenvalues of C C and CC are identical (see, e.g. Exercise 7.25(b) in Abadir and Magnus (2005)), and therefore The result follows since C 2 = A 2 2 + B 2 2 ≤ A 2 + B 2 by the c r -inequality. (b) The matrix D D is a principal submatrix of A A. From Exercise 12.48 in Abadir and Magnus (2005) we conclude that D The proof of Theorem 3 requires assumptions that are similar to those encountered in the main text. For the reader's convenience, we make the correspondence explicit by adhering to the original numbering while adding a "*". A short discussion of these assumption is found at the end of this section.
Assumption 4* (Restricted minimum eigenvalue). Define the ((K + 1)N × (K + 2)N ) matrix We assume that φ * min (x) := min Assumptions 1*(a)-(b) are also encountered in the main paper. Including exogenous regressors, the reduced-form representation of the model becomes y t = Cy t−1 + D t + K k=1 diag(β k )x t,k . Assumption 1*(c) merely assumes an upper bound on the magnitude of the exogenous regressor coefficients. Assumption 2* controls dependencies over time, in the cross-section, and with the exogenous regressors. Similarly to assumption A8(i) in Ma et al. (2021), we enforce exogeneity through Assumption 2*(a). Assumption 2*(b) is repeated from the main paper and its counterpart for the {x t,k }'s is encountered as Assumption 2*(c). All original moment conditions are also transferred to the exogenous regressors (Assumption 3*). Assumptions 1*-4* allow for an easy analogy with the assumptions in the main paper (at the cost of possibly being more restrictive than strictly necessary). That is, define * t = t + K k=1 diag(β k )x t,k and note the linearity of * t in t and {x t,1 , . . . , x t,K }. Due to fixed K, all mixing properties and moments conditions simply carry over to * t causing Lemma S3 to remain valid.

C.2 Proof of Theorem 3
Set β * = (β 1 , . . . , β K ) , U (d) = W function arê Exploiting the previously defined notation and (S8), we have We subsequently adjust the penalty function P α (c) to incorporate the penalty on the coefficients in front of the exogenous variables. Define the index set g k such that q g k = β k (k = 1 . . . , K) and enlarge G to G * = G ∪ K k=1 g k . As |β k | = |g k | = N , we have g∈G * |g| q g 2 + α q 1 =: P α (q).
To derive the probability of the inequality being true, we need the probability of the occurrence of the event H * 1 ( ). This probability is no smaller than and all probabilities in (S13) can be retraced to probabilities involving Q −Q . That is, bounding terms as in (38)-(39), we find and P H * 3 ( For (S15) and (S16), Assumptions 1*(a)-(b) and Assumption 1*(c) are needed to guarantee c 1 ≤ 1 and β * 1 ≤ C β , respectively. Also, P V * ( Given these bounds, (S13) translates to where f * (λ, φ * 0 ) = min All that remains is a lower bound for P Q − Q > x . We instead derive an upper bound for P Q − Q ≤ x as follows where ξ it denotes a generic element of an autocovariance matrix (as in Lemma S3). In accordance with Theorem 3, these RHS probabilities are equivalent to 1 − P * 1 (x, N, T ) (polynomial tails) and 1 − P * 2 (x, N, T ) (exponential tails).