Nonconcave penalized estimation in sparse vector autoregression model

High dimensional time series receive considerable attention recently, whose temporal and cross-sectional dependency could be captured by the vector autoregression (VAR) model. To tackle with the high dimensionality, penalization methods are widely employed. However, theoretically, the existing studies of the penalization methods mainly focus on i.i.d data, therefore cannot quantify the effect of the dependence level on the convergence rate. In this work, we use the spectral properties of the time series to quantify the dependence and derive a nonasymptotic upper bound for the estimation errors. By focusing on the nonconcave penalization methods, we manage to establish the oracle properties of the penalized VAR model estimation by considering the effects of temporal and cross-sectional dependence. Extensive numerical studies are conducted to compare the finite sample performance using different penalization functions. Lastly, an air pollution data of mainland China is analyzed for illustration purpose. MSC 2010 subject classifications: Primary 62M10, 62J07; secondary 62F12, 62H12.


Introduction
Penalized estimation methods are fundamental to explore the dataset with high dimensions. Particularly, it is becoming an increasingly popular tool to analyze the data in the fields of genomics, economics, neuroscience with abundant information. To facilitate the analysis, a reasonable assumption is that only a small number of covariates are associated with the response. Under this framework, to select the important covariates, the penalized regression estimation is usually conducted as follows, where Y ∈ R N is the response vector, X ∈ R N ×p is the design matrix, and β = (β 1 , · · · , β p ) ∈ R p is the regression coefficient. Here p λ (·) is the penalization function with the tuning parameter λ. Popular choices of penalization functions include L 1 penalty [23], SCAD penalty [8], MCP penalty [26] and many others. Despite the usefulness of the penalized regression method (1.1), its applications mainly focus on the i.i.d data [8,10]. However, in practice, data with complex dependence structures are frequently encountered. One of the particular types, the high dimensional time series data receives great attention. For example, in economics, structural analysis and economic trend prediction typically require for a large number of macroeconomic variables [6,1]; in finance, efficient risk management and quantification usually pull for numerous financial statements [11,17]; in social network analysis, network influences are estimated among social media users by user dynamic behaviors [29,27]. To model the dynamics of such type of the time series data, the vector autoregression (VAR) model is usually built and estimated. In a typical setting, a p-dimensional time series requires at least p 2 parameters of the transition matrix to be estimated, which are easily much larger than the number of time periods T (i.e., sample size). As a result, the estimation is infeasible unless we assume certain type of low dimensional structures. As we mentioned before, one of the most typical assumptions is the sparsity assumption. This enables us to incorporate the penalization regression framework (1.1) into the VAR model estimation.
Due to the special dependency structure of the high dimensional time series model, the theoretical framework of (1.1) should be reformulated. Particularly, the quantification of the temporal and cross-sectional dependence should be taken into consideration when studying the theoretical properties. In a recent paper of [2], they use the spectral properties to quantify the stability of a vector time series. Using the novel stability measure, they show that the convergence rate of the L 1 -penalized (i.e., Lasso) VAR model closely relates to the dependence level of the time series.
Despite the computational attractiveness and competitive ability for prediction, the L 1 -penalized (i.e., Lasso type) estimator requires more stringent conditions on the design matrix. As an alternative, the nonconcave estimators (e.g., SCAD and MCP) receive considerable attentions in recent years [10,25]. Theoretically, such type of estimators enjoy the desirable oracle property. Namely, it could identify the zero coefficients with probability tending to 1, and in the meanwhile estmate the nonzero coefficients as efficiently as if the sparsity pattern is known in advance. For the i.i.d data, the theoretical properties of the nonconcave penalized estimator are sufficiently studied [8,12,13]. Particularly, the feature dimensionality is allowed to grow exponentially fast with the sample size [10]. However, for the data with dependency structures, the theoretical properties are unknown and need to be investigated.
To fill this gap, in this work, we study the nonconcave penalized VAR model estimation methods. Particularly, we follow [2] to use the tool of spectral density to quantify the temporal and cross-sectional dependences of the high-dimensional time series. By using the novel measure of stability, we manage to explain how the convergence rate of the estimator is related to the dependence level of the time series. We contribute to the existing theory in the following three folds: (a) we establish the selection consistency for the high dimensional sparse stable VAR model by using nonconcave penalty functions, which assumes weaker condition than the irrepresentable condition; (b) we consider and involve the dependence measures in the convergence rate of the estimated parameters beyond the i.i.d setting in nonconcave penalized estimation; (c) we establish the oracle properties for the estimated parameters and the asymptotic normality results are proved, which is not achievable in the L 1 -penalty setting. Lastly, a real data example about an air pollution index in mainland China, i.e., PM 2.5 , is analyzed using the proposed method.
The rest of the article is organized as follows. Section 2 introduces the nonconcave penalization methods for VAR model estimation. Section 3 investigates the theoretical properties of the variable selection as well as the parameter estimation. Numerical studies are given in Section 4. The article is concluded with a brief discussion in Section 5. All technical details are left to the Appendix.

Model and notations
Consider a p-dimensional stationary time series vector {X t } = {(X 1t , · · · , X pt ) ∈ R p }, which are collected at time points t = 1, · · · , T . To model the dynamics of the X t , we consider a vector autoregression (VAR) model of lag d [VAR(d)] with serially uncorrelated random errors as where E t = (ε 1t , · · · , ε pt ) ∈ R p independently follows multivariate normal distribution with mean 0 and covariance cov(E t ) = Σ e ∈ R p×p . Here A 1 , · · · , A d are p × p-dimensional unknown transition matrices. They provide deep insights about the temporal and cross-sectional relationships among the p time series. In practice, the time series is usually of high dimension. As a result, the number of estimation parameters, i.e., dp 2 could grow polynomially fast with p. To estimate the model (2.1), we consider here a penalized least squares estimation method.
To facilitate the discussion, we first rewrite the model (2.1) in a vector form as follows.
In addition, define B = (A 1 , · · · , A d ) ∈ R dp×p to be the parameter matrix. Then we have vec(Y) = vec(X B) + vec(E) = (I p ⊗ X )β + vec(E), (2.2) where β = vec(B). By the vector form in (2.2), the VAR model could be represented in a general regression form. Define Y = vec(Y) and Z = I p × X ∈ R p(T −d+1)×q , where q = dp 2 . Then one could rewrite the VAR model (2.1) as a regression model with q-dimensional predictors as Y = Zβ + vec(E). To estimate the parameter, we minimize the following regularized least squares type objective function, which yields the penalized least squares estimator as where p λ (·) is the penalization function and λ ≥ 0 is the corresponding regularization parameter. A popular choice of p λ (·) is the L 1 -regularization, i.e., p λ (δ) = λ|δ|, which could result in a Lasso type estimator. This approach and corresponding statistical properties are studied by [2]. In this work, we restrict p λ (·) in the family of nonconcave penalization functions. Popular nonconcave penalization functions include the SCAD penalty [8], MCP penalty [26] and many others. Their function forms are visualized in Figure 1 and given in the numerical studies; see equation (4.1) and (4.2). See [21] for a comprehensive discussion. To facilitate the discussion, we define the following notations. Notation Throughout this paper, we denote the cardinality of a set S by |S|. In addition, let S c be the complement of the set S.
For convenience we omit the subindex when q = 2. Denote supp(v) as the support of the vector. Particularly, we use v 0 to denote |supp(v)| = p j=1 1(v j = 0) and v max to denote max j v j . In addition, denote v S as a sub-vector of v as v S = (v j : j ∈ S) ∈ R |S| . For arbitrary matrix M = (m ij ) ∈ R p1×p2 , denote M S = (m i,j : 1 ≤ i ≤ p 1 , j ∈ S) as the sub-matrix with columns in S. In addition, let M (S1,S2) be the sub-matrix of M as M (S1,S2) = (m ij : i ∈ S 1 , j ∈ S 2 ) for two sets S 1 and S 2 . We further write M (S,S) as M (S) for simplicity. Furthermore, denote M ∞ = max 1≤i≤p1 ( p2 j=1 |m ij |) and M max = max 1≤i≤p1,1≤j≤p2 |m ij |. For a symmetric or Hermitian matrix A, we use λ max (A) and λ min (A) respectively as its maximum and minimum eigenvalues. For arbitrary two sequences {a N } and {b N }, denote a N b N to mean that a N /b N → ∞ as N → ∞ and the otherwise. Lastly, we use e i to denote the ith unit vector.

Measure of stability
Consider a p-dimensional VAR(d) time series model (2.1). Assume it is centered and covariance stationary. As a result, we could define the autocovariance function as Γ X (h) = cov(X t , X t−h ). Generally, the autocovariance function characterizes the temporal and cross-sectional dependence for the VAR(d) model.
Since X t is a vector time series, it is of particular interest to investigate the stability properties of {X t }. In the classical time series analysis, the temporal dependence is usually controlled by imposing some mixing conditions [14]. In the context of VAR model, this sums to assuming that λ [20,19,16]. However, this condition might be restrictive and can be violated even by many stable VAR models. In addition, it fails to capture all the cross-sectional dependence in the high dimensional setting. In this work, we adopt the idea of [2] to use the spectral density of {X t } to establish the measure of stability. Specifically, we assume the VAR(d) model satisfies the following condition.
(C1) The spectral density function exists, and its maximum eigenvalue is bounded on [−π, π], i.e., The spectral density function f X (θ) has close relationship with the autocovariance function Γ X (l). If we have ∞ l=0 Γ X (l) 2 < ∞, then the spectral density exists. Furthermore, the spectral density is bounded, continuous, and the essential supremum, i.e., M(f X ), is actually its maximum. Generally, the existence of the spectral density will facilitate the following representation, The maximum eigenvalue of the spectral density function (2.4) could be a measure of stability of the process. Typically higher M(f X ) implies a less stable process.
For the VAR(d) process, the spectral density function has a closed form.
For the VAR(d) process [22,3] it holds Other than M(f X ), the lower extremum of the spectral density is also crucial when dealing with the dependence of the design matrix in the high dimensional setting, i.e., m(f X ) = ess inf θ∈ [−π,π] λ min (f X (θ)).
The maximum and minimum of eigenvalue of the spectral density provide important bounds on the dependence of the process X = (X 1 , · · · , X T ) ∈ R T ×p . Define Υ X = cov(vec(X )) ∈ R (T p)×(T p) . Then it holds that [2] 2πm which is free from the sample size N . Furthermore, for a stationary VAR(d) process M(f X ) and m(f X ) could be further bounded by a closed form as Note that the closed form of M(f X ) and m(f X ) of VAR(d) model clearly separates the two types of dependencies of {X t }: the temporal dependence captured by the transition matrices A j , and the additional cross-sectional dependence characterized by Σ e .

Remark 1.
Note that the VAR(d) model could be re-expressed as a VAR(1) model as follows. Specifically, we have Define A = I dp − A 1 z and μ max ( A) and μ min ( A) could be defined accordingly. The process { X t } is stable if and only if the process {X t } is stable [20].

Local optimality
In a general framework, note that the VAR estimation is equivalent to the following optimization where γ Z = (I p ⊗ X )Y/N, and Γ Z = Z Z/N = (I ⊗ X X /N ). By using the regularization, it leads to The solution is then given by β = arg min β∈R q Q p (β).
In this section, we discuss the theoretical properties of the penalized least squares estimator. The first concern is whether the resulting estimator attains local optimality. We first discuss the existence of the local minimizer of (3.2). It is closely related to the form of the penalty function p λ (·). Define ρ λ (t) = λ −1 p λ (t). We then give the characterization of the function ρ λ (t) as in the condition (C2).
Following [21] and [26], we define the local concavity of the penalty function Since it is assumed that ρ λ (·) takes a concave form by Condition (C2), we have κ(ρ; v) ≥ 0. Moreover, if the second order derivative of ρ λ (t) exists, one could de- The following proposition establishes conditions for the strict local minimizer of the objective function.
Proposition 1 (Local Minimizer). Assume p λ (·) satisfies Condition (C2). Then β ∈ R q is a strict local minimizer with probability tending to 1 if The proof is given in Appendix A.2. Specifically, Condition (3.4) and (3.6) ensure that β is a strict local minimizer when constraint on the β 0 -dimensional subspace {β ∈ R p : β S c = 0}. Condition (3.5) further makes sure that the sparse solution β is strict local minimizer on the whole space R q .
Under the framework of strict local minimizer, we discuss the theoretical properties of the nonconcave penalized VAR estimator. The main difficulty here from the i.i.d case is that one need to take the temporal as well as the crosssectional dependency into consideration. To this end, the deviation bounds for the design matrix and error terms are firstly established in Section 3.2, which are essential tools to develop the oracle properties. Next, the oracle properties are given in Section 3.3.

Deviation bounds
In this section, we establish some deviation bounds for both the design matrix and the error terms. The deviation bounds are important for deriving the asymptotic properties of the nonconcave penalized estimator. Particularly, it gives conditions for the design matrix and error terms to behave properly. We first give the results of the design matrix.
By (3.6), we know that the strict local minimizer requires the minimum eigenvalue of Z S Z S is well bounded. Define ω = c 1λ /c 2λ , where c 1λ = λ −1 max (Σ e )μ min (A) and c 2λ = λ −1 min (Σ e )μ max (A). We have the following proposition.

Proposition 2. Consider a stable VAR(d) model (2.1) and its vectorization form (2.2). Assume Condition (C1) holds. We have
The proof of Proposition 2 is given in Appendix A.3. By 1λ by (3.7) and (3.8) with probability tending to 1.
Besides the eigenvalue bound of the design matrix, the maximum absolute value bound is another important bound we need to quantify. This could be critical for deriving the variable selection consistency. Let Γ X (0) ∈ R dp×dp be the covariance of X t . Recall that Z = I p ⊗ X and define Γ Z = N −1 E(Z Z). Given the results in Proposition 2, we are able to derive the bounds for the maximum absolute value of (Z S Z S ) −1 and Z S c Z S in the following proposition.

Proposition 3. Consider a stable VAR(d) model (2.1) and its vectorization form (2.2). Assume Condition (C1) holds. We have
The proof of Proposition 3 is given in Appendix A.4. Proposition 3 establishes the upper bound of (Z S Z S ) −1 ∞ and Z S c Z S max respectively, which essentially restricts the covariance level among the covariates. Furthermore, one could probability under certain conditions, where c μ and c Γ involve the dependence measures of the VAR model. For simple stable VAR(1) models with no dynamic and cross-sectional dependences (e.g., A 1 = 0.5I p and Σ e = σ 2 e I p ), it can be verified that Note that the positive constant δ is involved in the left side as well as the right side of (3.10). A higher δ will possibly result in a tighter upper bound of Z S c Z S max , and in the meanwhile smaller probability in the right side. If 1λ N 1−δ and the probability for the event { Z S c Z S max ≤ c Γ } will tend to 1 as N → ∞. Next, one may note that Γ (S c ,S) Z max could be very related to the irrepresentable assumption of the i.i.d case [28], while for the VAR model it involves the dependence structures explicitly. We further give a comment about the Γ (S c ,S) Z max in the following remark. Remark 2. Note that Γ Z takes a block diagonal structure. Generally, we have Γ 1λ with high probability, where higher c −1 1λ implies higher dependence levels. The upper bound given by (3.10) is tighter. To see this, one could express Γ It can be then concluded that, if for any column j of B, there exists at most one element B ij = 0 for 1 ≤ i ≤ pd, then we have Γ . Hence, as one can see, the dependence level of the VAR model is explicitly involved here compared to the irrepresentable assumptions.
Lastly, we establish the upper bound for the error terms. For the vector Then the convergence rate of the penalized VAR model estimator is largely determined by how concentrate ξ is around 0. In addition note ξ ∈ R q naturally constitutes a high dimensional vector. As a result, it is important to control the diverging rate of ξ ∞ . Specifically, we have the following proposition.

Proposition 4. Assumes Condition (C1) holds. We then have
with c 0 as a finite positive constant.
The proof of Proposition 4 is given in Appendix A.5. By Proposition 4, the concentration of ξ around 0 is affected not only by the sample size and the parameter dimension, but also by the temporal and contemporaneous dependence of the time series. As we mentioned before, the temporal and contemporaneous dependencies are characterized by A and Σ e respectively. Particularly, the error bound is lower when the eigenvalues of A and Σ e behave more uniformly, i.e., λ max (Σ e ) and μ max (A) are smaller, and in the meanwhile λ min (Σ e ) and μ min (A) are larger. This results in a less spiky spectrum and leads to lower temporal and cross-sectional dependencies, i.e., lower Q(β, Σ e ) [2].

Oracle property
Given the bounds on the design matrix as well as the error bounds, we establish the oracle properties of the nonconcave penalized VAR estimator in this section. The weak oracle property is firstly given, which is introduced by [21]. It has two folds of meaning: (a) first, with probability tending to 1 the nonzero coefficients could be estimated to be exact 0 (i.e., β S c = 0), and (b) second, the estimator is consistent in the sense of L ∞ loss. Although the weak property does not give the asymptotic distribution of the estimator, it provides insights about the asymptotic behaviors of the penalized estimator. The following conditions are assumed.
). First, Condition (C3) imposes restrictions on the correlation level between Z S and Z S c as well as the temporal and cross-sectional dependence. In the i.i.d case, it is typical to directly assume Z S c Z S (Z S Z S ) −1 ∞ is bounded by the right side [10]. For the VAR model, we could obtain the upper bound from Proposition 3, which involves Γ However, for general stable VAR models, the situations can be more complicated. According to [2], even with the same spectral radius of the transition matrix, the dependence measures (e.g., c 2λ and Γ (S c ,S) Z max ) could still be quite different and might lead to different convergence behaviours of the estimated parameters. Next, the upper bound Cρ (0+)/ρ (d N ) in (3.14) is closely related to the penalty form. If the L 1 penalty is used, then it requires the left side of (3.14) is strictly less than 1.
In such a case, this condition could lead to the strong irrepresentable condition proposed by [28], which is restrictive in practice. While for nonconcave penalty function, e.g., SCAD penalty, the upper bound could grow to ∞, which makes this condition easier to satisfy.
Next, Condition (C4) sets the assumption about the minimum signal and the tuning parameter λ N , which is critical for deriving the converging rate of the estimator. The requirements are mostly the same with the i.i.d case, except that it further includes terms related to the dependency measures, e.g., Q(β S , Σ e ) [10]. Specifically, in the VAR model setting, the condition for the minimum signal d N is more restrictive. For instance, if the dynamic as well as the cross sectional dependence is higher, then it will lead to higher Q(β S , Σ e ), thus requires the signal strength stronger enough to be detected.
Lastly, Condition (C5) guarantees the local optimality of the estimator. The condition corroborates with the λ min (Z S Z S ) condition in the i.i.d case [10], which regularizes the multilinearity of Z S . In the VAR model, the λ min (Z S Z S ) is connected to the dependence measure c 2λ by (3.7), which results in Condition (C5). This condition can be easily satisfied by the SCAD type condition when λ N d N , which leads to κ 0 = 0 with sufficiently large N . We then establish the weak oracle property as follows.
Then there exists a nonconcave penalized least squares estimator β such that for sufficiently large N , with probability tending to 1, we have (a) (Sparsity). β S c = 0; The proof of Theorem 1 is given in Appendix B.1. Note that the parameter dimension q is allowed to grow exponentially fast with the sample size N , and the growth rate is controlled by α. In addition, the dependence term Q(β, Σ e ) is involved in the upper bound of the L ∞ loss of β S − β S ∞ . This implies that the time process with lower temporal and cross-sectional dependencies could lead to a tighter upper bound. When γ = 1/2, this result corroborates with the finding of [2] for the L 1 penalty. Lastly, one might note the convergence rate is slightly slower than O p ( s/N ) in the i.i.d case under L 2 loss. This is because here we use the L ∞ loss in the derivation of the consistency rate.
Next, we establish the oracle properties of the nonconcave penalized VAR estimator. Let α λ = Q(β, Σ e )λ max (Σ e )c 2λ . We first establish the existence of the nonconcave penalized VAR estimator and its convergence rate. To this end, we need the following condition.
The Condition (C6) gives the minimal signal strength for the nonconcave penalized VAR estimator to achieve s/N -consistency. Due to the assumption d N λ N , the SCAD type penalties satisfy the condition on This contradicts with the assumption . This explains that the L 1 penalized estimator could not achieve the consistency rate of O p ( s/N ) even with low dependence levels. Compared to the i.i.d case [10], the condition involves further requirements on the minimum signal with respect to the dependence levels, i.e., α λ . Specifically, higher dependence levels will result in a higher α λ , thus will require d N to be larger to be detected. We give the convergence rate for the nonconcave penalized VAR estimator as follows.
The proof of Theorem 2 is given in Appendix B.2. Under the signal strength d N , the Theorem 2 states the parameter dimension that the nonconcave penalized VAR estimation method could handle. Note that the dependence term α λ is also involved in the upper bound of β S − β S . This partially explains how the temporal and cross-sectional dependence affects the convergence rate of β S − β S . We further investigate the asymptotic normality of the nonconcave penalized VAR estimator, which is stated as follows.
. Under the conditions of Theorem 2, with probability tending to 1 and N → ∞, for a stable VAR(d) process, the nonconcave penalized likelihood estimator satisfies: . The proof of Theorem 3 is given in Appendix B.3. Note that from Theorem 3, if s is finite, then by setting A N = I s , we could obtain the asymptotic normality of β S − β S directly.
In practice, one may consider to further conduct inference based on (3.15).
We then establish the following theorem for the consistency of the Σ S .  = o{N min(1, ω 2 That is mainly because the estimation of Σ e requires to estimate O(p 2 ) parameters. It then needs larger sample size to obtain a consistent estimate. In addition, higher dependence levels also have influences. Specifically, higher dependences will imply larger α λ , λ max (Σ e ), c 2λ , c −1 1λ values, thus restrict s to be smaller.
To examine the performance of the nonconcave penalized VAR estimation, we conduct a number of numerical studies, which are discussed in details in the following section.

Simulation models
In this section, we evaluate the finite sample performance of the nonconcave penalized VAR model. To generate the data, we first generate the transition matrix A 1 = (a 1,ij ) ∈ R p×p , which has about 5% nonzero off-diagonal entries. Specifically, we set the off-diagonal nonzero elements to be 0.3, and all the diagonal elements to be 0.5. Following [2], we generate the innovation process using Gaussian process with the following three covariance structures Σ e : (1) Block I: Σ e = (σ e,ij ) ∈ R p×p with σ e,ii = 1, σ e,ij = ρ if 1 ≤ i = j ≤ p/2, and σ e,ij = 0 otherwise.
Here larger ρ values indicate that the innovation processes have higher correlation with each other.

Implementation of the algorithm
For comparison, we implement the following algorithms. They are, the Lasso estimator [23,2], the adaptive Lasso (ALasso) estimator [30,4], the SCAD estimator ((4.1) with a = 3.7) proposed by [8], and MCP estimator ((4.2) with a = 1.5) proposed by [26]. Specifically, the penalty functions of SCAD and MCP are given as below, namely and Furthermore, in the numerical study, to make the computation more feasible and increase the efficiency, we use screening method to reduce the parameter dimension before we conduct the model selection and estimation. Specifically, we use the SIS method proposed by [9] on the regression form (2.2) of the VAR model and keep the q * = 200 covariates with highest absolute correlations with the response variable.
Lastly, to optimize the objective function (3.2), we use the local linear approximation (LLA) algorithm proposed by [31]. To choose the tuning parameter, the HBIC criterion [25] is employed, which expresses as where M λ = {(i, j) : a 1,ij = 0} denotes the selected set, and C N = log{log(N )} is slowly diverging with N .

Performance measurements and simulation results
We then evaluate the sparse recovery and estimation accuracy of the nonconcave penalized VAR model. For each simulation setting, we consider (1) Medium VAR (p = 30, d = 1, T = 80, 120), and (2) Large VAR (p = 50, d = 1, T = 100, 150) respectively. In addition, we set ρ = 0.5, 0.6 respectively to reflect different levels of dependences. To obtain a reliable evaluation, the experiment is replicated for R = 500 times. Denote the estimation of the transition matrix A 1 of the rth replication as A i,j I(a 1,ij = 0, a 1,ij = 0). Next, we evaluate the estimation accuracy. To this end, we calculate the root mean square error (RMSE) for the transition matrix A 1 as RMSE A = ( A 1 − A 1 2 F /p 2 ) 1/2 , where · F denotes the Frobenius norm of a matrix. The simulation results are given in Tables 1-3. First, under higher dependency levels, the performances of both sparsity recovery as well as model estimation are affected and worse than the lower dependence levels. That corroborates with the theoretical properties established in this work. Second, comparably speaking, while the true positive numbers are similar for all the methods, the nonconcave penalization methods are capable to achieve lower false positive number, which leads to a more parsimonious model. For instance, in Example 2 with p = 50, T = 100 and ρ = 0.5 (i.e., Table 2), the SCAD penalty is able to control the FP at about 5.8, while the FP for the Lasso method is 58.3, which is almost 10 times larger than the SCAD method. In the meanwhile, in terms of the estimation accuracy, the nonconcave penalization methods could obtain relatively lower estimation errors. For example, the RMSE for the MCP penalized VAR model is about 0.020 with p = 50, T = 150, and ρ = 0.6 in Example 3 (i.e., Table 3), which is much smaller than the Lasso method with RMSE = 0.032.
Lastly, one could observe that the adaptive Lasso is also a competitive method, which also has better performance than the Lasso method in terms of the finite sample performance. Compared to the nonconcave penalization method, we would like to comment that the nonconcave penalty is typically more flexible than the adaptive Lasso approach. That is because, an element being estimated as zero (e.g., by SCAD penalty) can escape from zero in the next iteration; while the adaptive Lasso absorbs zeros in each iteration, and always results in sparser solutions than the initial values [7]. Therefore in practice the adaptive Lasso method requires to set the initial values more carefully.

Air pollution data analysis
In recent years, the air pollution has become a more and more serious problem in mainland China. Along with this issue, the PM 2.5 index becomes a popular tool to quantify the air pollution level. It refers to the particle with aerodynamic diameters less than 2.5 micrometers. Particularly, high concentration of PM 2.5 could lead to severe clinical symptoms, such as lung morbid-ity, respiratory and so on [18,5]. It yields an important question to understand the distribution and diffusion pattern of the PM 2.5 spatially and temporally.

Data description
The

Model estimation and exploration
According to the concentration levels of the PM 2.5 index, we split the data into 3 time periods: from January to March (Period I), from April to September (Period II), from October to December (Period III). The Period II, which mostly ranges from summer to early autumn, has lower PM 2.5 levels than the other two periods. Specifically, we use the log-transformed PM 2.5 levels as responses, which are centered with mean 0 for each city. Then, for each time period, we implement the nonconcave penalized VAR model with the SCAD penalty 1 to obtain the results. Here we only consider the lag-1 response to maintain the model simplicity. To further save the computational complexity, a SIS screening procedure [9] is firstly conducted to keep 300 edges with highest absolute correlations with the response. The HBIC criterion (4.3) is used for model selection.
We visualize the estimated transition matrix A 1 using heatmap in Figure 4-6 for the three time periods. The estimated coefficients are all within [0, 1]. The cities in the figure are ordered roughly from north to south and east to west. Therefore the neighbouring cities are close in spatial distance to each other. First, we observe that the patterns of Period I and Period III are similar and slightly different from Period II. In Period II, we observe relatively stronger momentum effects (i.e., higher diagonal elements in A 1 ), and lower betweencity influences (i.e., sparser off-diagonal elements in A 1 ). While in Period I and Period III the transition matrices exhibit denser between-city edges. This provides evidence of the air pollution diffusion effects in neighbouring cities, especially in Spring and Winter. Next, one could observe that the estimated A 1 is not symmetric. In Period I and Period III, we have more estimated nonzero elements in lower triangle of A 1 than the upper triangle. This implies that the air diffusion direction is from North to South (recall that the cities are ordered roughly from North to South). Lastly, we further visualize the betweencity coefficients of A 1 in the map; see Figure 7. By the connection pattern, one could further confirm that the influences in Period I and Period III are more intense than Period II. In addition, it is found the between-city connections in Period II are mostly local and among neighbouring cities than the other two periods.
Lastly, we compare the VAR model estimation using the nonconcave penalization methods and the Lasso penalty. The HBIC criterion (4.3) is used for model selection. We compare the performances in terms of the model sparsity, i.e., measured by number of nonzero estimates, and fitted level, i.e., measured by the fitted RMSE. The measurements are reported in Table 4. First, compared to the Lasso method, the nonconcave methods (e.g., SCAD and MCP) and the adaptive Lasso method are able to obtain VAR models with less nonzero parameters. For example, for Period III, the number of nonzero estimates of Lasso is 91, while the nonzero estimates of SCAD is 43, which is much smaller than the Lasso method. In the meanwhile, although the model achieved by the nonconcave methods is more parsimonious, better fitting levels could be obtained by the nonconcave methods. For instance, the fitted RMSE for the SCAD method of Period II is 0.372, while the RMSE for the Lasso method in the same period is 0.388.

Conclusion
In this work, we investigate the nonconcave penalized VAR model estimation methods. Specifically, the estimation properties are established under the influence of both temporal and cross-sectional dependences. The oracle properties are given for the nonconcave penalties, where the feature dimensionality is allowed to grow exponentially fast with the sample size. Lastly, an air pollution dataset is analyzed for illustration propose, it is found that the influence patterns among different cities are highly related to the specific areas as well as the seasons in mainland China.
To conclude this work, we consider the following directions as future research topics. First, although the regularized VAR model estimation could help to recover the connection patterns among the nodes, however, the regularization level is not clear. Therefore, certain type of criterions (e.g., BIC) should be designed to select the true model efficiently. Second, as mentioned in the numerical study, before we conduct penalization estimation, screening methods could be firstly used to reduce the computational complexity. As a result, efficient screening algorithms should be proposed to better suit the complex dependence structure of the data. In addition, we could observe that from the numerical performance the adaptive Lasso is also a competitive method. It is then of great interest to study the theoretical properties of other shrinkage methods. Lastly, the penalization methods could be revised when new information of the nodes is obtained. For example, when the network relationships are observed among the nodes, new mechanisms to combine such known structure information should be further investigated.

Lemma 2. Assume the same conditions in Proposition 3. Then we have
Proof of Lemma 2. The proof is similar to the proof of (3.10). By Proposition 2.4 (2.9), and (2.6) of [2], we have where c * is a finite positive constant. Therefore, it can be concluded with probability at least 1 − 6 exp{−c * N min(η 2 , η)}, we have Further note that Γ 1λ by (2.5). By summing over all 1 ≤ i, j ≤ q − s, and letting η = 1/3, we have which yields the result by letting c = c * /9.

Lemma 3. Assume the conditions in Theorem 4 hold. Then we have
Proof of Lemma 3. Note that Σ Ze = Σ e ⊗ Γ X (0) = ( Σ e ⊗ I dp ) Γ Z . Then we have ). We then prove P (H c 1 ) → 0 and P (H c 2 ) → 0 respectively as follows.
Define B is the matrix form estimator obtained from β. Then we have Σ e = X (B − B). Then it suffices to show that Note that the second one (A.7) could be implied by (A.6) and (A.8). We only show the proofs of (A.6) and (A.8) respectively as follows. First by (A.1) we have ≤ 2 exp{−cN min(η, η 2 ) + p log(21)}.
By letting η = 1/2 and p N , the results can be obtained.

A.2. Proof of Proposition 1
The proof is basically the same as Theorem 1 of [10]. For the completeness, we state the basic idea as follows. First we give the necessary conditions. Define the unpenalized objective function as Q(β) = −β γ Z + 2 −1 β Γ Z β, where γ Z = (I p ⊗ X )Y/N, and Γ Z = Z Z/N = (I ⊗ X X /N ). Then we havė By the classical optimization theory, if β = ( β 1 , · · · , β q ) is the local minimizer of the penalized objective function (3.2), then the Karush-Kuhn-Tucker (KKT) conditions hold. Namely, Note that β is also a local minimizer of of (3.2) on the subspace {β ∈ R q : β S c = 0}. From the second order condition, we have where κ(ρ; β S ) is defined in (3.3). Equivalently, (A.9) can be expressed as which correspond to (3.4) and (3.5). Next, we show the sufficient conditions. First, constrain the penalized objective function (3.2) on subspace defined by S as B = {β ∈ R q : β S c = 0}. From (3.6), we could conclude that the penalized objective function is strictly convex in a ball N 0 in the subspace B centered at β. Along with (3.4), this implies that β is a critical point of Q p (β) and also the unique minimizer of Q p (β).
Then we only need to prove that the sparse vector β is ia strict local minimizer of Q p (β) on the space R q . To this end, define a sufficiently small ball N 1 in R q centered at β such that N 1 ∩ B ⊂ N 0 . It then suffices to show that Q p ( β) < Q p (γ 1 ) for any γ 1 ∈ N 1 \N 0 . Let γ 2 be the projection of γ 1 on the subspace B. Then we have γ 2 ∈ N 0 , which implies Q p ( β) < Q p (γ 2 ) if γ 2 = β due to that β is the strict minimizer of Q p (β) in N 0 . Therefore it suffices to show that Q p (γ 2 ) < Q p (γ 1 ).

Proof of (3.7). Let
Combining the results, we have Proof of (3.8). Similarly, for any v ∈ R q , by (A.4) we have By the same technique in proof of (3.7) and letting η = 1, the results can be obtained.

A.5. Proof of Proposition 4
Note that for the any vector ξ, ξ ∞ = ξ max . Therefore we only show the results hold for the maximum norm of ξ S and ξ S c . By Proposition 2.4 (2.11) of [2], we have First, by letting η = c This proves (3.12).
Along with the definition of H 1 , it yields Define the vector valued function In addition, let where u = −(Z S Z S ) −1 (ξ S − η). By (3.9) and (B.2), we then have By conditions of Theorem 1, we have s 1/2 c 2λ = o(N 1/2−γ (log N ) 1/2 ). Thus for the first term we should have Furthermore, by Condition (C4), we have By the Miranda's existence theorem, there exists a solution β S in N such that Ψ(θ) = 0, which is also the solution of Ψ(θ) = 0.
Step 2: Verification of Condition (3.5) Let β ∈ R q with β ∈ N being a solution to (3.4) and β S c = 0. We next show that β satisfies (3.5). Note that On the event H 2 , we have ( (1). Recall that β S solves the equation that Ψ(θ) = 0. As a result, we have Consequently, by (B.2) we have Therefore for the second and third terms in above z ∞ we have by Condition (C3) where the first inequality is due to Nλ N = N α+1/2 log N Q(β, Σ e ) of Condition (C4), and the second is due to Condition (C3) for sufficiently large N .
To this end, we need to analyze the function Q(θ) on the boundary ∂N τ . Let N be sufficiently large that α λ s/N τ ≤ d N since d N α λ s/N by Condition (C6). By Taylor's expansion we have where θ * = Z S θ * , and θ * lies on the line segment joining θ and β S . Note that the second order derivative of the penalty p λ does not necessarily exist. One could verify that the second part of D can be replaced by a diagonal matrix with maximum absolute element bounded by λ N κ 0 . Recall that for any θ ∈ ∂N τ , we have θ − β S = α λ s/N τ . Since θ * ∈ ∂N τ by Condition (C6), we then have Consequently, we have Step 2. (Sparsity) Let β ∈ R q with β S ∈ N τ ⊂ N 0 and β S c = 0. It suffices to show that β is a strict local minimizer of Q(β) on the space of R q . From (3.5), it suffices to check that z ∞ < ρ (0+), where where ξ = Z (Y − Zβ). We deal with the two parts in z respectively. First we consider the event and . For the second part we have Here we consider the event By (3.8) we have P {N −1 λ max (Z S Z S ) ≥ 2c −1 1λ } ≤ 2 exp{−cN + s log 21}. In addition, by (A.5) of (2), P {max i |e i (Z S c Z S c )e i | ≥ 2c −1 1λ } ≤ 2 exp{−cN + 2 log q}. By summing over i = 1, · · · , q − s, we have P (T 2 ) ≥ 1 − 2 exp(−cN + s log 21 + 3 log q), which converges to 1 by the assumption that log q = o(N α ) in Condition (C6) and s = o (N min(1, ω 2 )). Under the event T 2 , we have given the condition λ N s/N α λ c −1 1λ in Condition (C6).

B.3. Proof of Theorem 3
By Theorem 2, we only need to prove the asymptotic normality of β S . It has been shown that β S is a strict local minimizer and β S c = 0. As in the proof of Theorem 2, β S is a strict local minimizer of Q(θ) and β S c = 0. Therefore we have ∂Q( β S ) = 0. It can be derived ∂Q(θ) = −2 γ + 2 Γ Z θ + p λ (θ), where p λ (θ) = (p λ (θ 1 ), · · · , p λ (θ s )) . By conditions in Theorem 3, we have We then have Γ (S) where E = vec(E) and Z S denotes the submatrix of Z with column indexes in S. Further it can be derived that To prove the result, it suffices to show that for any A N ∈ R m×s with AA → G, we have N (0, G).
For the first, using the Cauchy's inequality, it suffices to verify that Ze ) ≥ λ min (Γ X (0))λ min (Σ e ) ≥ c −1 2λ λ min (Σ e ). Consequently, the last term could be dominated by the second term. We then show N −1/2 A N (Γ (S) Since m is finite, then it suffices to show that for any η with η = 1, that Define the σ-field F t = σ{ε is , 1 ≤ i ≤ N, −∞ < s ≤ t}. As a result, the sequence { t s=1 ξ s , F t } constitutes a martingale array. To show the asymptotic normality of T t=1 ξ t , we employ the central limit theorem of the martingale difference array. We then verify the two conditions in Corollary 3.1 of [15].