Testing for high-dimensional network parameters in auto-regressive models

High-dimensional auto-regressive models provide a natural way to model influence between $M$ actors given multi-variate time series data for $T$ time intervals. While there has been considerable work on network estimation, there is limited work in the context of inference and hypothesis testing. In particular, prior work on hypothesis testing in time series has been restricted to linear Gaussian auto-regressive models. From a practical perspective, it is important to determine suitable statistical tests for connections between actors that go beyond the Gaussian assumption. In the context of \emph{high-dimensional} time series models, confidence intervals present additional estimators since most estimators such as the Lasso and Dantzig selectors are biased which has led to \emph{de-biased} estimators. In this paper we address these challenges and provide convergence in distribution results and confidence intervals for the multi-variate AR(p) model with sub-Gaussian noise, a generalization of Gaussian noise that broadens applicability and presents numerous technical challenges. The main technical challenge lies in the fact that unlike Gaussian random vectors, for sub-Gaussian vectors zero correlation does not imply independence. The proof relies on using an intricate truncation argument to develop novel concentration bounds for quadratic forms of dependent sub-Gaussian random variables. Our convergence in distribution results hold provided $T = \Omega((s \vee \rho)^2 \log^2 M)$, where $s$ and $\rho$ refer to sparsity parameters which matches existed results for hypothesis testing with i.i.d. samples. We validate our theoretical results with simulation results for both block-structured and chain-structured networks.


Introduction
Vector autoregressive models arise in a number of applications including macroeconomics (see e.g. Ang and Piazzesi [2003], Hansen [2003], Shan [2005]), computational neuroscience (see e.g. Goebel et al. [2003], Seth et al. [2015], Harrison et al. [2003], Bressler et al. [2007]), and many others (see e.g. Michailidis and dAlché Buc [2013], Fujita et al. [2007]). Recent years has seen substantial development in the theory and methodology of high-dimensional auto-regressive models with respect to parameter estimation (see e.g. Song and Bickel [2011], Basu et al. [2015], Davis et al. [2016], Medeiros and Mendes [2016], Mark B. and R. [2018]). In particular if there are M dependent time series (e.g. voxels in the brain, actors in a social network, measurements at different spatial locations), time series network models allow us to model temporal dependence between actors/nodes in a network.
More precisely, consider the following time series auto-regressive network model with lag p, where {X t } T t=0 ∈ R M is the time series data we have access to, {A * (j) ∈ R M ×M , j = 1, . . . , p} are the network parameters of interest and t ∈ R M is zero-mean noise. We are considering the high-dimensional setting where the number of nodes M in the network is much larger than the sample size T . Prior work in Basu et al. [2015] has addressed the question of how to estimate the network parameter A * with Gaussian noise t under sparsity assumptions and various structural constraints. In this paper, we focus on inference and hypothesis testing for the parameter A * given the data (X t ) T t=0 .
In high-dimensional statistics, there has recently been a growing body of work on confidence intervals and hypothesis testing under structural assumptions such as sparsity. Since the widely used Lasso estimator for sparse linear regression is asymptotically biased, one-step estimators based on bias-correction have been studied in works such as Zhang and Zhang [2014], Van de Geer et al. [2014] and Javanmard and Montanari [2014] which are referred to as LDPE, de-sparsifying and de-biasing estimator respectively. Low-dimensional components of these estimators have asymptotic normality and thus can be used for constructing hypothesis testing and confidence intervals.
Both γ and I θγ I −1 γγ are substituted by some estimator, and it is shown in Ning et al. [2017] that the decorrelated score function is asymptotically normal.
• We also construct semi-parametric efficient confidence region for multivariate parameters with fixed dimension; • Finally we support our theoretical guarantees with a simulation study on bounded noise, which is sub-Gaussian but not Gaussian.

Related Work
In the literature on inference for high-dimensional VAR models, most work focuses on the estimation problem. Song and Bickel (Song and Bickel [2011]) investigate penalized least squares algorithms for different penalties, with some externally imposed assumptions on the temporal dependence. Theoretical guarantees on Dantzig type and Lasso type estimators are studied in Han et al. [2015] and Basu et al. [2015], but with Gaussian noise. Barigozzi and Brownlees (Barigozzi and Brownlees [2018]) consider the inference for stationary dependence structure built among variables, other than the parameters in the VAR model. In our work, we control the error bounds of Lasso and Dantzig type estimators for parameter matrices, with sub-Gaussian noise. Then we establish asymptotic distribution of test statistic based on this.
In the high-dimensional hypothesis testing literature, there is some work regarding to testing for high-dimensional mean vector (Srivastava [2009]), covariance matrices (Chen et al. [2010], Zhang et al. [2013]) and independence among variables (Schott [2005]). While for testing on regression parameters, most work assumes i.i.d samples. ,  and Lee et al. [2016] proposes methods to test whether a covariate should be selected conditioning on the selection of some other covariates. A penalized score test depending on the tuning parameter λ is considered in Voorman et al. [2014]. Our work follows the a line of work by Zhang and Zhang [2014], Van de Geer et al. [2014], Javanmard and Montanari [2014] and Ning et al. [2017], the de-sparsifying or decorrelated literature. We construct a VAR version of decorrelated score test proposed by Ning et al. [2017]. Chen and Wu (Chen and Wu [2018]) tackles the hypothesis testing problem for time series data as well, but they are testing the trend in a time series, instead of the autoregressive parameter which encodes the influence structure among variables.
As mentioned earlier, our work is most closely related to the prior work of Neykov et al.Neykov et al. [2018], which provides a hypothesis testing framework with high-dimensional Gaussian time series as a special case. In our work, we consider the more general and technically challenging case of sub-Gaussian vector auto-regressive models. Throughout this paper, we provide a comparison to results derived in this work for the Gaussian case.

Organization of the Paper
Section 2 explains the problem set up and proposes our test statistic. Theoretical guarantee is shown in section 3. Specifically, section 3.1 and 3.2 present the weak convergence rate of test statistic under the null and alternative hypothesis H 0 and H A . Section 3.3 propose some feasible estimators, which satisfy the assumptions required and can be plugged into the test statistic. Section 3.4 considers the case when the variance of noise are unknown, and we construct a confidence region for multivariate parameter vectors in Section 3.5. We consider the special case of the AR(1) model with Gaussian noise, a detailed comparison with Neykov et al. [2018] is provided in section 3.6. Section 4 provides simulation results and section 5 includes the proofs for the two main theorems. Much of the proof is deferred to Appendices.

Notation
We define the following norms for vectors and matrices: For a vector u = (u 1 , . . . , u d ) ∈ R d , we define the p-norm where p ≥ 1, u p = d i=1 u p i 1 p . For a matrix U ∈ R m×n , the p norm and Frobenius norm of U is defined as n j=1 U 2 ij 1 2 . We also use notation U 1,1 to denote the 1 penalty on U , which is m i=1 n j=1 |U i,j |. Furthermore, if U is symmetric the trace norm of U is U tr = tr( √ U 2 ).
Throughout the paper, we assume that the entries of noise vectors { ti , 1 ≤ i ≤ M } ∞ t=−∞ are independent sub-Gaussian variables with constant scale factor. A univariate centered random variable X has a sub-Gaussian distribution with scale factor τ if M X (t) E [exp(tX)] ≤ exp(τ 2 t 2 /2).

Problem Setup
We consider a general vector auto-regressive time series with lag p, where p is known and finite and independent of T or other dimensions: where X t ∈ R M , t ∈ R M is zero-mean entry-wise independent sub-Gaussian noise with identity covariance matrix, and A(j) ∈ R M ×M , j = 1, · · · , p are parameters of interest. Define the matrix A * = (A(1), · · · , A(p)) ∈ R M ×pM and X t = (X t , · · · , X t−p+1 ) ∈ R pM , then we can also write (2) as X t+1 = A * X t + t .
For notational convenience, we assume that time series data X t has time range 1 − p ≤ t ≤ T .
Based on data (X t ) T t=1−p , we test the hypothesis of whether a subset of entries in A * are 0. Let A * i be the ith row vector of A * . Without loss of generality, suppose the entries we test are in rows 1, · · · , k. Define D m ⊂ {1, · · · , pM } as the columns we test in mth row with d m = |D m |, We test the null hypothesis: We also assume that d is finite and not increasing with T . In the work of of Neykov et al.Neykov et al. [2018], d is assumed to be 1.

Stationary distribution
Since we are developing a hypothesis testing framework based on the decorrelated score test, it is important to specify a stationary distribution for X t Using standard notation from autoregressive time series models, define the polynomial A(z) = I M − p j=1 A(j)z j , where I M is an M × M identity matrix, and z is a complex number. To guarantee the existence of a stationary solution to (3), we assume det(A(z)) = 0, |z| ≤ 1.
Then we can write where Ψ j , j ≥ 0 are all real valued matrices which are polynomial functions of A(i), 1 ≤ i ≤ p. Note that in the special case where p = 1, Ψ j = (A * ) j .
It can be shown that the unique stationary solution to (2) is

Decorrelated Score Function
Using the frameworks developed in Ning et al. [2017] for independent design, we consider the decorrelated score test. First we define the score function S(A * ) ∈ R M ×M , with each entry defined as follows: As pointed out in Ning et al. [2017], the standard score function is infeasible and we need to consider the decorrelated score function with each S m ∈ R dm corresponding to the tested row (m, D m ): where X t,Dm ∈ R dm is composed of the entries of X t whose indices are within set D m . X t,D c m ∈ R pM −dm is also defined similarly and w * m ∈ R (pM −dm)×dm is chosen to satisfy Specifically, w * m is defined as a function of Υ = Cov(X t ) ∈ R pM ×pM :

Test Statistic
Based on the decorrelated score function S m , we first define the statistic V T,m ∈ R dm : with Υ (m) ∈ R dm×dm being defined as: Let V T be the d-dimensional vector concatenated by V T,m 's: One of the main results of the paper is to show that V T is asymptotically Gaussian. Define U T = V T 2 2 , then U T is asymptotically χ 2 d . Since we do not know t , w * m , and Υ (m) , we later define estimators for these quantities. Formally, we define our test statistic U T as where Υ (m) ∈ R dm×dm is an estimator for Υ (m) and S m ∈ R dm is defined as with A m ∈ R pM andŵ m ∈ R (pM −dm)×dm estimating A * m and w * m . Here we are not worried about the invertible issue of Υ (m) , since Υ (m) is a low dimensional covariance matrix. To guarantee a good estimation of the high-dimensional parameter A * m and w * m , we impose sparsity conditions upon them. Specifically, for each 1 and note that they both depend on A * .
The sparsity of w * m can be implied by the sparsity of Υ −1 , which is a common condition in high-dimensional hypothesis testing literature (e.g. see Van de Geer et al. [2014]). Specifically, the following Lemma shows that when lag p = 1 and A * is symmetric, the sparsity of w * m is implied by the sparsity of A * : The proof for Lemma 2.1 is included in Appendix E.

Theoretical guarantee
In this section, we present uniform convergence results for test statistic U T under H 0 and H A , with A * and estimators satisfying conditions. We also provide feasible estimators, and prove that they satisfy corresponding conditions in Section 3.3. Unknown variance and confidence region construction is discussed in Section 3.4 and 3.5. In Section 3.6 we provide consequences of our theory under AR(1) model with Gaussian noise and compare our results with Neykov et al.Neykov et al. [2018].
Recall that the null hypothesis is While for the alternative hypothesis, like in Ning et al. [2017], we consider with some constant φ > 0 and constant vector ∆ ∈ R d . Write where each ∆ m ∈ R dm . The reason why T −φ ∆ instead of ∆ is considered in (12) is that we expect the test to be more sensitive as sample size increases. We will see how the value of φ influences the convergence of U T in Theorem 3.2.
We still assume ti 's are i.i.d. sub-Gaussian random variables, and also consider a special case, where t ∼ N (0, I). We compare our result in the Gaussian case to results in Neykov et al.Neykov et al. [2018].
First we define the sets Ω 0 and Ω 1 of feasible parameter matrices A * under H 0 and H A respectively. To control the stability of {X t } in model (3), we impose the condition: hold for 1 ≤ m ≤ k, with probability at least 1 − c 1 exp{−c 2 log M }.
These are standard error bounds for Lasso estimator and Dantzig Selector with independent design. In this paper we verify Assumption 3.1 in section 3.3 and the remaining two assumptions when we have dependent sub-Gaussian random variables, as we do for our vector auto-regressive model setting.
Similar to Assumption 3.1, we will show that both Lasso estimator and Dantzig selector under model (3) satisfy Assumption 3.2.
Note that Υ (m) ∈ R dm×dm is a low-dimensional matrix, and thus it is computationally feasible to use the sample covariance matrix of X t,Dm −ŵ m X t,D c m as an estimator for Υ (m) . We show in section 3.3 that, as long asŵ m is a reliable estimator for w * m , Υ (m) would satisfy a tighter bound than (19). This looser bound in Assumption 3.3 actually allows more choices for estimators for (Υ (m) ) −1 , as shown in section 3.5.

Uniform convergence under null hypothesis
Based on these assumptions, we have the following main theorem.
when T > C for some constant C. Here the constants C i 's depend on p, d, β, τ .
Theorem 3.1 proves weak convergence of U T to χ 2 d . The uniform convergence rate can be understood as follows: the first term is due to the rate obtained by martingale CLT, where we require T − 1 8 rather than T − 1 2 due to the dependence; the remaining two terms arise from estimation error, with the second one being the error bounds, and third being the probability that the error bounds do not hold. If we assume Gaussianity, we can improve the first term in the rate of convergence from T − 1 8 to T − 1 4 +α for any α > 0. To the best of our knowledge, ours is the first work that formally attempts to characterize the rates of convergence.
Remark 3.1. Compared to the theoretical result for independent design in Ning et al. [2017], the only additional condition we add is ∞ i=0 ∞ j=0 Ψ i+j 2 2 1 2 ≤ β, which is used to control the strength of dependence uniformly. Also, we consider multivariate testing which is more general, and derive the explicit convergence rate.
Remark 3.2. The test statistic proposed in Van de Geer et al. [2014] and Javanmard and Montanari [2014] for the independent design share similar ideas with our test statistic. Instead of imposing a sparsity assumption upon w * m , Van de Geer et al. [2014] assumes Υ −1 to be row wise sparse. This is actually equivalent to the sparsity assumption on w * m in the univariate case. Javanmard and Montanari [2014] does not require the sparsity condition on Υ −1 , but it is hard to extend their theory to the time series setting, due to a difficulty in applying the martingale CLT.
Remark 3.3. The theoretical guarantee we obtained here, is more general and stronger than the result achieved in Neykov et al. [2018]. A more detailed comparison is presented in section 3.6.

Uniform convergence under alternative hypothesis
Recall the definition of Ω A in (16). The following theorem establishes the asymptotic behavior of U T for A * ∈ Ω A , with different values of φ. First define where Υ (m) is defined in (8).
Here C i 's are constants depending on p, d, β, ∆, τ .
Theorem 3.2 shows the threshold value of φ for H A to be detectable. When φ > 1 2 , we cannot distinguish H 0 and H A since under both cases U T converges to χ 2 d ; When φ < 1 2 , U T diverges to +∞ in probability, thus it would be very easy to detect H A ; When φ = 1 2 , U T converges to a non-central χ 2 d with noncentrality parameter determined by constant vector ∆ and Υ = Cov(X t ), which implies the power of the test. Note here, (23) holds also for the trivial case φ < 0, since we do not use the fact φ > 0 in the proof.
Remark 3.4. Theorem 3.2 is also consistent with the threshold value of φ given by Ning et al. [2017] for linear regression with i.i.d samples. However, Ning et al. [2017] assumes additional conditions on the scaling of sample size, number of covariates and sparsity of w * m for proving asymptotic power. Our conditions are exactly the same as the ones for H 0 , due to a more specific model and careful analysis.

Feasible Estimators
Both the estimation of w * m and A * can be viewed as high-dimensional sparse regression problems, thus we can use the Lasso or Dantzig selector. Formally, define as the Lasso estimator for A * , and as the Dantzig selector estimator for A * . Similarly, for 1 ≤ m ≤ k, definê While for estimating Υ (m) , since this is a low dimensional covariance matrix for X t,Dm − w * m X t,D c m , we can directly use sample covariance of X t,Dm −ŵ m X t,D c m as Υ (m) : for 1 ≤ m ≤ k. Hereŵ m in the definition of (29) is eitherŵ As shown in the following, estimators (25) to (29) all satisfy Assumptions 3.1 to 3.3, under the model setting stated in (3): , which are defined as in (25) and (26) with probability at least 1 − c 1 exp{−c 2 log M }, when T > Cs 2 log M .
Note here Lemma 3.3 is stronger than Assumption 3.3. The proof of these Lemmas are deferred to Appendix A. By these lemmas and Theorem 3.1, 3.2, we arrive at following Corollary.  (29) and T > C for some constant C > 0, bounds (20) to (24) from Theorems 3.1 and 3.2 hold.

Variance Estimation
In this section, we consider the case where σ * 2 = Var( ti ) is unknown under model (3). Actually, if σ * = 1 is known, it is straightforward to extend Theorem 3.1 to Theorem 3.2 for U T defined as follows: This follows since if we consider Y t = X t /σ * , time series data Y t would satisfy the same model but with unit variance noise.
When σ * 2 is unknown, we apply the estimator and define the test statistic We show that U T has the same convergence results we derive for the unit variance noise case.
Theorem 3.3. Consider the model (3) with i.i.d. sub-Gaussian noise ti of variance σ * 2 = Var( ti ) ≥ σ 2 0 > 0 and scale factor τ σ * . Then Theorem 3.1 and 3.2 hold for U T under each corresponding condition, and constants C i 's also depend on σ 0 . Theorem 3.3 shows that when we have to estimate the unknown σ * 2 , test statistic U T maintains the same asymptotic behavior as U T under the known variance case, given that all the assumptions for estimation errors are satisfied and σ * is lower bounded by some constant.
Remark 3.5. With sub-Gaussian noise ti , if we still assume the scale factor τ σ * of ti to be bounded by constant, then Lemma 3.1 to 3.3 would still hold. Thus the assumptions imposed on estimation errors of A,ŵ m and Υ (m) are all satisfied. However, if we don't assume σ * to be bounded, then the tuning parameters λ A and λ w have to scale with σ * .
Remark 3.6. Neykov et al. [2018] proposes another estimator for the variance of ti , based on the fact that Σ = AΣA + Cov( t ). Both these estimators are consistent and lead to convergence in distribution results.

Semi-parametric Optimal Confidence Region
In this section, we construct a confidence region for A D , under model (3) with unknown noise variance σ * 2 . Similar to Ning et al. [2017], we consider the one-step estimatorâ(m) for each (A * m ) Dm , based on the decorrelated score function: where A m is any estimator satisfying the Assumptions 3.1 on error bounds for A m − A * m , and both the Lasso or Dantzig Estimator for A * m are suitable. Υ (m) takes the form: which is another estimator for Υ (m) , and We will show thatâ(m) − (A * m ) Dm is asymptotically Gaussian with covariance matrix (Υ (m) ) −1 . Thus we construct the following confidence region for A D , with asymptotic confidence coefficient 1 − α: This is a d dimensional elliptical ball with center vector (â(1) , . . .â(k) ) . The following theorem shows the weak convergence result of Theorem 3.4. Under model (3) with i.i.d. sub-Gaussian noise ti with variance σ * 2 = Var( ti ) ≥ σ 2 0 > 0 and sub-Gaussian parameter τ σ * , then Theorem 3.1 and 3.2 hold for R T under each corresponding condition, and the constants C i 's also depend on σ 0 . Remark 3.8. We have exactly the same theoretical result for U T and R T , and this is due to the close relationship between these two quantities. In particular, −1 also satisfies Assumption 3.3 as an estimator for Υ (m) −1 .
Remark 3.9. The one-step estimatorâ(m) is asymptotically unbiased, and shares a similar form to the de-biased estimator proposed by Zhang and Zhang [2014], Van de Geer et al. [2014]. The de-biased estimator in Van de Geer et al. [2014] would take the following form under our setting: where Θ is computed by node-wise regression, as an estimator for Υ −1 . When d m = |D m | = 1, this is essentially the same as our estimatorâ(m), but would be slightly different in the multivariate case. Note that the asymptotic covariance matrix forâ(m) equals to the partial information matrix I * (A m,Dm |A m,D c m ), and thus is semi-parametric efficient, whileb m is only efficient when it is a scalar.
Remark 3.10. R T is also very similar to the test statistic proposed by Neykov et al. [2018] for VAR model with lag 1. The only difference lies in the estimation of Var( ti ), and they only consider Dantzig selector for estimating A * and w * m . We will provide a detailed comparison between their theoretical result with ours in section 3.6.

Special case: AR(1) with Gaussian noise
Our theoretical guarantee covers VAR models with lag p and sub-Gaussian noise, of which AR(1) model and Gaussian noise are special cases. Here we explain the consequences of our result under this special case and provide comparison with Neykov et al. [2018].
When we consider lag p = 1, the constraint for The two sparsity conditions and sample size requirement are included in the conditions Neykov et al. [2018] proposes. In addition, they assume the following: for some 0 < ε < 1. Note that we don't require these conditions, among which the first and third are quite strong, and the second one Until now the discussion focuses on the case where ti are i.i.d. sub-Gaussian noise of scale factor Cσ * , with (σ * ) 2 being the variance of ti and lower bounded by some constant. Thus our setting covers the case where t ∼ N (0, (σ * ) 2 I) with σ * ≥ c. If t ∼ N (0, Ψ) with Ψ ii ≥ c as assumed in Neykov et al. [2018], we can still prove the same theoretical guarantee, under even weaker condition based on spectral density, due to established concentration bounds in Basu et al. [2015].

Numerical Experiments
In this section, we provide a simulation study to validate our theoretical results. For simplicity, our simulation is based on the AR(1) model: where A * ∈ R M ×M is set to be row-wise sparse. Symmetricity is not required in our theory, but in order to ensure the sparsity of w * m , we focus on symmetric matrices under H 0 , and slightly asymmetric ones under H A . The eigenvalues of A * all fall in the unit circle of the complex plane, which ensures the existence of stationary solution to this model. White noise ti is simulated as independent Uniform(−1, 1) in order to satisfy the sub-Gaussianity condition. Other distributions were also used but not reported since the results were very similar.
To consider multi-variate test sets, throughout the simulation we test the index set D with d = |D| = 6, which involves three different rows and two columns in each row: The null hypothesis takes the form H 0 : A D = µ with some d-dimensional vector µ. Correspondingly, we consider alternative hypothesis H A : A D = µ + T −φ ∆, with ∆ randomly selected from d-dimensional Gaussian distribution, and φ ranges from 0.25 to 1.2.
Under H 0 , we generate A * with different row-wise sparsity levels and structures, and for each A * , vector µ may differ depending on the corresponding A D . Under H A , A * are still the same matrices as under H 0 , but only adding the tested indices A D by T −φ ∆. The experiments are repeated under different settings of A * , ∆, M, T and φ.
We use Lasso estimators defined in (25), (27) for the estimation of A * and w * m , 1 ≤ m ≤ k, and tuning parameters λ A , λ w are selected using cross validation. In cross validation, the training sets are composed of consecutive time series data, with the remaining 10% of the original data set being testing sets. Under H 0 , 1000 simulations are carried out under each parameter setting, while under H A , we have 100 simulations. In the following sections, we look into false positive rates (FPR) and true positive rates (TPR) of test statistics U T and R T as defined in (32) and (36), when we set the level of test as α = 0.05.

Under the Null Hypothesis
(1) Varying sparsity Here we summarize the experiments with randomly generated A * , that are symmetric and row-wise sparse, with different sparsity levels ρ defined in (10). Figure 1 shows how FPR of U T and R T averaged over 1000 experiments vary with √ T . We can see that when T increases to about 500, the FPR becomes stable and close to α = 0.05 regardless of ρ, M , choice between U T and R T .
When the sample size T is small, the test tends to be conservative, which is the consequence of estimating variance σ * 2 and covariances Υ (m) 's. In the simulation we use naive estimators for these two quantities, as defined in (31) and (29) which tend to be smaller than the true parameters. This is because we usually fit noise in the regression, as noticed by Fan et al. [2012]. As shown in these two figures, R T is less conservative than U T when T is small, probably a better estimator for Υ (m) . We also summarize the FPR when the variance σ * 2 of ti is known in Figure 2. We can see from these figures that U T is still a little conservative when T is small, while R T withσ 2 substituted by σ * 2 is not conservative.
(2) Different Graph Structures If we consider the M actors in the time series as nodes in a network, and a nonzero A * ij represents an directed edge from j to i, then each matrix A * corresponds to a M -dimensional directed graph. We experiment with different structures of A * , which also correspond to different graph structure, including block graph or chain graph. Specifically, we consider matrices with 2 norm equal to 0.75: 0 0 · · · · · · · · · 1/4 1/2 0 0 · · · · · · · · · 1/2 1/4 which is a block graph; with constant c chosen to ensure A (2) 2 = 0.75, which is a chain graph; and A (3) being randomly generated symmetric matrix of sparsity level ρ = 2, and largest eigenvalue equal to 0.75. Figure 3 shows the difference among these three different structures. We can see Figure 3: FPR under different graph structure. Block refers to A (1) , chain refers to A (2) and random refers to A (3) . that block graph is less accurate than the other two, which is due to a larger variance for Investigating the question of how graph structure theoretically influences testing performance remains an open and interesting direction.

Alternative Hypothesis
First we look into how the true positive rate (TPR) varies with T −φ ∆ 2 , since we set H A as A D = µ+T −φ ∆ and T −φ ∆ 2 may be viewed as a measure of distance from the null hypothesis. Fig. 4 only presents the simulation results when A * = A (1) and M = 300, while the other choices of A * and M generate very similar results. We can see from these two figures that as T −φ ∆ 2 increases, TPR approaches 1. The slope increases when sample size T gets larger, or when the test statistic changes from R T to U T . This aligns with intuition, since when T increases, we are supposed to distinguish between H 0 and H A better, and U T is more conservative than R T as we show in subsection 4.1.
We also check the influence of φ. Figure 5 reveals how TPR changes when T increases, if we set ∆ 2 and φ fixed. If φ < 0.5, TPR converges to 1 very quickly, while if φ > 0.5, TPR converges to 0.05, but the convergence is slower when φ or ∆ 2 increases. When φ = 0.5, Theorem 3.3 and 3.4 states that U T and R T would converge to χ d, ∆ 2 2 , thus the TPR should converge to some value between 0.05 and 1, depending on d and ∆ 2 2 . The black lines in figure 5 indicate this convergence value, but since the test tends to be conservative when T is not large enough, TPR when φ = 0.5 is usually above the black line. The conservative issue is more severe under H A since the deviation ∆ is also multiplied by the estimated variances, which exaggerates the conservative tendency. However, this may not be a big concern under H A , since we always want the TPR to be large.

Proof Overview
One of the main contributions of this work is the proof technique, which addresses a number of technical challenges and develops novel concentration bounds for dependent sub-Gaussian random vectors. In this section, we present and discuss key lemmas for the proof and provide the main steps for proving Theorems 3.1 and 3.2, deferring the more technically intensive steps to the supplement.

Key Lemmas
The major technical challenge lies in proving the following two concentration bounds for dependent sub-Gaussian random vectors.
Lemma 5.1 is a standard deviation bound for proving estimation error bound of Lasso type or Dantzig selector type estimators. We apply this lemma both in the proof of Theorem 3.1, 3.2 and Lemma 3.1. . Results for different graph size M from 30 to 300 are combined together and average TPR is taken. Red line is significance level α, the value that TPR should converge to when φ < 0.5; while the black line is the convergence point specified in Theorem 3.2 when φ = 0.5.
Lemma 5.2. Under model (3), when ti are sub-Gaussian noise with constant scale factor τ , and Lemma 5.2 provides concentration bound for the sample average of general quadratic form X t BX t , and is very helpful in proving martingale CLT under our setting, REC, Lemma 3.3, etc.
In the Gaussian case, both these lemmas follow from prior work in Basu et al. [2015] which relies on the fact that dependent Gaussian vectors can be rotated to be independent. Since dependent sub-Gaussian random variables cannot be rotated to be independent (only uncorrelated), we exploit the independence of t by representing each X t by linear function of the infinite series { i } i=t i=−∞ and then use a careful truncation argument. We analyze sufficiently many terms in the summation, and control the infinite residues.

Proof of Theorem 3.1
Proof. Suppose A * ∈ Ω 0 . We will use C i , c i to refer to constants that only depend on p, d, β, τ (not M or T ), and different constants might share the same notation.
The proof can be divided into two major parts: showing the convergence of U T to χ 2 d , and bounding the estimation error U T − U T . Formally, for any ε > 0, In the following, we provide bounds on each of the three terms. The following lemma shows the uniform weak convergence rate of V T + µ 2 2 to χ 2 d, µ 2 2 , of which the convergence of U T = V T 2 2 to χ 2 d is a special case.
Lemma 5.3 (Convergence Rate of V T + µ 2 2 ). Under model (3) with ti being sub-Gaussian noise of scale factor τ , then for any A * ∈ Ω 0 , ∀µ ∈ R d , when T > C for some absolute constant C, where C( µ 2 ) is a constant depending on and is non-decreasing with respect to µ 2 .
This Lemma is proved in section C, by applying a uniform martingale central limit theorem result. Thus, by Lemma 5.3, if T > C for some constant C, since χ 2 d has bounded density. Now we only need to choose a proper ε and bound P U T − U T > ε .
We can bound V T,m 2 using Lemma 5.3 and Υ 19, while for bounding the estimation induced error E m 2 , we first apply the following lemma to bound the eigenvalues of Υ (m) .
Lemma 5.4. Consider the model (2) with independent noise ti of unit variance, A * satisfies (13), then the eigenvalues of Υ can be bounded as follows: Lemma 5.4 is proved based on established results in Basu et al. [2015]. Note that we assumed unit variance in Theorem 3.1 and 3.2, so we can apply Lemma 5.4 here. Since Υ (m) −1 = Υ −1 Dm,Dm , applying Lemma 5.4 would lead us to the following: The following two lemmas provide bounds for 1 Lemma 5.1 is a common condition in high-dimensional regression problems, and is usually referred to as deviation bound. We will prove it in Section C.
Lemma 5.6 (Deviation Bound for w * m ). With probability at least 1 − c 1 exp{−c 2 log M }, for all 1 ≤ m ≤ k, Lemma 5.6 can also be viewed as a deviation bound, if we consider a regression problem with X t,Dm as response and X t,D c m as covariates. This is also proved in Section C. Applying Assumptions 3.1 and 3.2, with probability at least 1 − c 1 exp{−c 2 log M }, and Assumption 3.1 and 3.2 implies Q 1 ≤ C ρm log M T and Q 2 ≤ C sm log M T . The former is not straightforward: to see why it holds true, letĥ m = A m − A * m and H = 1 Here we apply Assumption 3.1, and the fact that The last inequality is due to Lemma 5.4 and the following lemma: Lemma 5.7. With probability at least 1 − c 1 exp{−c 2 log M }, Therefore, by taking a union bound, we show that for any 1 ≤ m ≤ k, with probability at least 1 − c 1 exp{−c 2 log M }.
Meanwhile, by applying Lemma 5.3, one can show that for y > √ 5d, where the second inequality is due to a χ 2 d tail bound established in Laurent and Massart [2000] (see Lemma 1 in Laurent and Massart [2000]), and the third inequality comes from the fact that, ∀ constant C 1 > 0, ∃ constant C 2 such that sup y≥0 y 2 e −C 1 y 2 ≤ C 2 .
Let y = (s∨ρ) log M √ T − 1 4 and plug it into (41), then with Assumption 3.3, we can show that with probability at least , the following holds: and T > C for some constant C. Therefore, applying (38) with Since constants C i only depend on d, β and τ , this bound also holds for supremum over A * ∈ Ω 0 and x ∈ R. Note that for a clear presentation, we are not showing the sharpest bound, which can be obtained by choosing a different y.

Proof of Theorem 3.2
proof of Theorem 3.2. We prove this case by case. We will use C i , c i to refer to constants that only depend on d, β, ∆, φ, and different constants might share the same notation.
Similar from the proof of Theorem 3.1, the major part of the proof is devoted to bounding U T − V T + µ 2 2 with high probability for some vector µ ∈ R d .
(1) φ = 1 2 Suppose A * ∈ Ω 1 . Using similar deduction as in the proof of Theorem 3.1, for any ε > 0, (a) Bounding the first two terms The first term is the convergence rate of V T − ∆ 2 2 to χ 2 d, ∆ 2 2 . By Lemma 5.3, The last inequality is due to and an upper bound for Λ max Υ (m) in (42).
Bounding the second term in (46) is not straightforward as bounding in the proof of Theorem 3.1, since ∆ is not a constant vector when A * takes different values in Ω * 1 . We only have a uniform bound of ∆ 2 as shown above. One can show that where Z is a d-dimensional standard Gaussian random vector with density φ(z) = C(d) exp{− z 2 2 /2}. The last inequality holds because that, for any set C ⊂ R d , Similar from (41) in the proof of Theorem 3.1, it is straightforward to show that where with S m ∈ R dm and W * m ∈ R dm×M defined as follows: Therefore, The last inequality applies (42). Meanwhile, The first equality and second inequality come from the definition of W * m and w * m ; the third inequality is because that Υ ·i 2 2 = Υ 2 ii ; the fourth inequality is due to that Υ 2 ii = e i Υ 2 e i ≤ Λ max (Υ) 2 ; and the last inequality is obtained from Lemma 5.4. Applying Lemma 5.7 leads us to We can write S m − S m as due to Lemma 5.4 and 5.7, which further implies Applying Assumption 3.1 to 3.3, Lemma 5.1, 5.6, one can show that with probability at least 1 − c 1 exp{−c 2 log M }, with the same arguments as bounding S m − S m 2 under H 0 .
, applying Lemma 5.3 leads us to for any y ≥ 0, where Z ∼ N (0, I d ). We apply the tail bound for χ 2 d (Lemma 1 in Laurent and Massart [2000]) as in (45), and obtain (51) and (19) into (47), one can show that Therefore, applying (46) Since constants C i only depend on d, β, ∆, τ , this bound also holds for supremum over A * ∈ Ω 1 and x ∈ R.
(2) 0 < φ < 1 2 First we provide a lower bound for U T with high probability. Since bounds in Assumption 3.1 to 3.3, Lemma 5.1 to 5.7 hold with probability at least 1 − c 1 exp{−c 2 log M }, we apply these bounds directly in following deduction. Meanwhile, we always assume (ρ∨s) log M = o( √ T ) and T > C for desired constant C. With these conditions, one can show that The third line is due to Assumption 3.3, which implies Υ We provide a lower bound for (Υ we find the upper bounds for E : then using the same argument as bounding S m − S m 2 when proving Theorem 3.1, we have

To lower bound E
(2) m 2 , first note that where we apply (49), Lemma 5.7, Assumption 3.2, and bound 1 T T −1 t=0 X t X t ∞ using the same argument as in (50). Thus, since ∆ m is a constant vector, and Λ min (Υ (m) is lower bounded by constant as in (42).

Applying these bounds for E
Plug this into (52) and apply Lemma 5.3, we have where in the last line we apply the χ 2 d tail bound as in (45). Since the constants here only depend on d, β, ∆, τ , this bound holds when taking supremum over A * ∈ Ω 1 and x ∈ R.
(3) φ > 1 2 The proof of this case is similar to that of Theorem 3.1. The only thing different lies in the choice of ε and bounding P U T − U T > ε . The bound (41) We directly apply the bounds in Assumptions 3.1 to 3.3, and Lemma 5.1 to Lemma 5.7 in the following. First we write Note here that the first three terms are exactly the same as in (43), and thus can be bounded as in the proof of Theorem 3.1. We only have to tackle the last term. By (53), one can show that, Thus, going through the same arguments as bounding S m − S m 2 under H 0 , we have with probability at least 1 − C exp{−c log M }. Recall that in (45), when y > C for some constant C, , then by (41) one can show that with probability at least and T > C for some constant C. Therefore, applying (38) with Since constants C i only depend on d, β, τ, ∆, this bound also holds for supremum over A * ∈ Ω 1 and x ∈ R.

Conclusion
In this paper, we have provided theoretical guarantees for hypothesis tests for sparse highdimensional auto-regressive models with sub-Gaussian innovations. Specific upper bounds for the convergence rates of test statistics are given. Importantly, our results go beyond the Gaussian assumption and do not rely on mixing assumptions. As a consequence of our theory, we also develop novel concentration bounds for quadratic forms of dependent sub-Gaussian random variables using a careful truncation argument.
It would be of interest to consider other variance estimation method, e.g., scaled Lasso Sun and Zhang [2012], or cross-validation based method Fan et al. [2012], and establish corresponding theoretical guarantee. There also remain a number of open questions/challenges including extensions to generalized linear models, heavy-tailed innovations and incorporating hidden variables under time series setting. S. Van

A Proof of Lemmas in Section 3.3
Proof of Lemma 3.1. We prove the error bounds for each A m and then take a union bound. Without loss of generality, we consider the estimation of A * 1 ∈ R M . With a little abuse of notation, let S = supp(A * 1 ),ĥ = A 1 − A * 1 , S = supp(A * 1 ), and H = 1 T T −1 t=0 X t X t (S is not the decorrelated score function we defined in section 9). We would like to bound ĥ 1 , ĥ 2 and h Hĥ under two cases separately: (1) A = A (L) .
Here we adopt the standard proof framework for Lasso. By (25) we know that A 1 ∈ R M satisfies Rearranging the terms, we havê The last line is due to that By Lemma 5.1, with probability at least 1 − c 1 exp{−c 2 log M }, We have the following restricted eigenvalue condition for H.
Proof of Lemma 3.2. Without loss of generality, we consider the estimation of (w * 1 ) ·,1 and then take a union bound.
The following proof is almost identical to the proof in Lemma 3.1 under A = A (L) , except some difference in notation and application of Lemmas. One can show that,

Rearranging the inequality gives uŝ
By Lemma 5.6, with probability at least 1 − c 1 exp{−c 2 log M }, Leth ∈ R M be defined as the following: By Lemma A.1, when T ≥ Cs log M , with probability at least 1 − 2 exp{−cT }, with probability at least 1 − c 1 exp{−c 2 log M }.
(2)ŵ m =ŵ This proof is also pretty similar to the proof of Lemma 3.1 under the case where A = A (D) . By Lemma 5.6, with probability at least 1 − c 1 exp{−c 2 log M }. Thus, Meanwhile, by (58), Recall the definition ofh in (57),then by Lemma A.1, (59) and (57), when T ≥ Cs log M , with probability at least 1 − c 1 exp{−c 2 log M }.

Since
taking a union bound over {ŵ m : m = 1, · · · , k} and all columns ofŵ m , proof is complete.
Proof of Lemma 3.3. The following established result can be applied here: Since I 2 = 1, one can show that for 1 ≤ m ≤ k, (42), In the following we bound Υ (m) − Υ (m) where W * m is defined as in (48). Actually, which is the maximum over deviations of some quadratic forms from their expectation. The following lemma provides a bound for quadratic form 1 T T −1 t=0 X t BX t , with B ∈ R M ×M being any symmetric matrix.
By Lemma 5.2, we only need to bound the trace norm and operator norm of The following lemma establishes the relationship between · tr and · 2 for symmetric matrices.
Lemma A.3. For any symmetric matrix U of rank r, U tr ≤ r U 2 .
Meanwhile, similar from (49) where the second inequality is due to that Υ ·,i Meanwhile, by Lemma 5.6 and Assumption 3.2, with probability at least 1 − c 1 exp{−c 2 log M }, Here the second line is because that D c m is symmetric and positive semi-definite, thus we can apply Cauchey-Schwartz inequality. When T ≥ Cs 2 log M .
Therefore, take a union bound over 1 ≤ m ≤ k, with probability at least 1 − c 1 exp{−c 2 log M }, when T ≥ Cs 2 log M .

B Proof of Theorem 3.3 and Theorem 3.4
Proof of Theorem 3.3. Now we consider model (3), with unknown σ * 2 = Var( ti ) ≥ σ 2 0 . Under this model, we use the notation U T for the quantity defined in the following: As explained in Section 3.4, U T satisfies Theorem 3.1 and 3.2 under each corresponding condition. We show in the following that we only need to control the estimation error ofσ 2 . Note that for any 0 < δ < 1, For any distribution function F (x), Recall that Theorem 3.1 and 3.2 establish bounds for 2 (x) when φ = 1 2 , and for P U T ≤ x when 0 < φ < 1 2 . Thus we only need to bound P σ 2 < σ * 2 1+δ , P σ 2 > σ * 2 By Assumption 3.1 and Lemma 5.1, with probability at least 1 − c 1 exp{−c 2 log M }, Also, since ti are independent sub-Gaussian random variables with scale factor Cσ * , the first term can be bounded by Bernstein type inequality of sub-exponential random variables(see proposition 5.16 in Vershynin [2010]): Here Z ∈ R d is a standard Gaussian random vector, the third line is due to that the density of Z is (2π) − d 2 e − z 2 2 /2 , and the fourth line applies the fact that when 0 < δ < 1 2 , Meanwhile, when x(1 − δ) < µ 2 , and when x(1 − δ) ≥ µ 2 , To see why all the bounds for U T still hold for U T , note that we only need to add C ρ log M T + 2 exp {−c 1 ρM log M } + c 2 exp{−c 3 log M } to the bounds under H 0 , and under H A when φ ≥ 1 2 , which only changes the constant factors of the previous bounds. For the bound under H A when 0 < φ < 1 2 , we substitute x by x 1−δ with δ = C log M T , and add 2 exp {−c 1 ρM log M } + c 2 exp{−c 3 log M }, which only changes the constant factors as well. Therefore, all the conclusions for U T in Theorem 3.1 and 3.2 still hold for U T under each corresponding condition.
Proof of Theorem 3.4. First we show the connection between R T and U T . Note that and the only difference between R T and U T is that we substitute Υ (m) We only need to prove that Υ (m) Recall that when proving Lemma 3.3, we already upper bound Υ (m) − Υ (m) ∞ by C log M T with probability at least 1 − c 1 exp{−c 2 log M }. Thus for any vector u ∈ R dm s.t u 2 = 1, We bound E ∞ in the following. One can show that Applying (42), (62), Lemma 5.7, we have Thus, with Lemma 5.6, Assumption 3.2, and (63), we show that with probability at least 1 − c 1 exp{−c 2 log M }, Therefore, using the same arguments as in the proof of Lemma 3.3, By Lemma A.2,

C Proof of Lemmas in Section 5
Proof of Lemma 5.3. Let Define filtration F T,t = σ(X −p+1 , X −p+2 , · · · , X t+1 ), then (ξ T t , F T t ) 0≤t≤T −1 is a martingale difference sequence, and V T = T −1 t=0 ξ T,t . To bound the convergence rate, we are going to use a modified version of Lemma 4 in Grama and Haeusler (2006).
By Lemma C.1, to bound sup x>0, P( V T + µ 2 2 ≤ x) − F d, µ 2 2 (x) , we only need to bound Here the second line is due to Λ min (Υ (m) ) ≥ 1, and the third line is due to f (x) = x 1+δ is a convex function. More specifically, While for the last line, since t,m is sub-Gaussian with parameter τ , E| t,m | 2+2δ ≤ C(δ). Note that d, β, τ are all viewed as constants here. Due to the sub-Gaussianity of t,i 's, we have the following lemma.
Lemma C.2. Therefore, While for N T,d δ , since where the second line is because that (Υ (m) ) − 1 2 B m (Υ (m) ) − 1 2 is of rank at most d m , and we can apply Lemma A.3; the last line is due to by Lemma 5.2, we only need to bound the operator norm and trace norm of By (61) and (62), we have the following: Therefore, applying Lemma 5.2 leads us to Thus, By Lemma C.1, for any x ≥ 0, µ ∈ R d , and 0 ≤ δ ≤ 1 2 , when T > C(δ), The best rate is achieved when δ = 1 2 , and thus when T > C, Proof of Lemma 42. We prove the lower and upper bounds for eigenvalues of Υ, by establishing a connection between our stability condition (13) and another spectral density based condition proposed in Basu et al. [2015]. First we introduce the following lemma, which is a direct result of proposition 2.3 and (2.6) in Basu et al. [2015] under our setting.
By Lemma C.3, we only need to prove that condition (13) implies a lower bound for µ min (A) and upper bound for µ max (A). First note that where the last equality is due to that (A * (z)) −1 . Meanwhile, for any |z| = 1, where we apply condition (13) in the last inequality. Thus µ min (A) ≥ β −2 .
While for bounding µ max (A), we start by bounding A n 2 for 0 ≤ n ≤ p. Here we define A 0 = I M ×M , and A n = 0 for all n > p. Since one can show that Ψ 0 = I, and n i=0 Ψ i A n−i = 0 for n ≥ 1. Thus and A n 2 ≤ n i=1 Ψ i 2 A n−i 2 . We have the following claim: For 0 ≤ n ≤ p, A n 2 ≤ β n ∨ 1.
This can be proved by induction. It is clear that A 0 2 = I 2 = β 0 , and if (64) holds for 0 ≤ n = k ≤ p, Therefore, µ max (A) can be bounded in the following: With Lemma C.3, we conclude that j ∈ R pM ×M as the following: then we can also write X t as an infinite sum X t = ∞ j=0 Ψ (p) j t−j−1 . Without loss of generality, we consider the first entry of 1 In the following, we tackle the infinite sum in (66), by focusing our analysis on the finite sum and let the residue converges to 0. Rigorously, for any positive integer m, let and e (t) ∈ R (T +m+1)M satisfying e (t) i = 1(i = (t + m)M + 1), then we have We will let m be sufficiently large in later argument. The following arguments are devided into two parts: bounding E 1 and E 2 .
(1) Bounding E 1 Since all entries of˜ are independent sub-Gaussian with constant parameter, we can apply the following Hanson-Wright inequality: Lemma C.4. Let X = (X 1 , . . . , X n ) ∈ R n be a random vector with independent components X i which satisfy E(X i ) = 0 and X i ψ 2 ≤ K. Let A be an n × n matrix. Then, for every t ≥ 0, This lemma is a result in Rudelson et al. [2013].By Lemma C.4, we only need to bound the norms of 1 For any u, v ∈ R (T +m+1)M with unit 2 norm, one can show that . Since Γ is a Toeplitz matrix, we will use the following lemma to bound its 2 norm.
Lemma C.5. Let f (λ) be a Fourier series defined as f We define a sequence of Toeplitz matrices T n with (T n ) i,j = t i−j , then the operator norm of T n is bounded by where ess sup f the essential supremum. This is actually Lemma 4.1 in Gray et al. [2006], and we directly apply it here. By Lemma C.5, While for the Frobenius norm, we have Therefore, by Lemma C.4, for any δ > 0, (2) Bounding E 2 First note that Recall the definition of · ψ 1 and · ψ 2 in the proof of Lemma C.2. Since 2 t,1 ψ 1 ≤ 2 t,1 2 ψ 2 ≤ 2τ 2 , by Bernstein type inequality of sub-exponential random variables(see proposition 5.16 in Vershynin [2010]).

Now we bound the second term
where we apply the fact that t 2 ψ 2 ≤ C √ M τ , which is shown in the proof of Lemma C.2. Thus we have due to the tail bound of sub-exponential r.v. (also see Vershynin [2010]). Since Let m be sufficiently large such that ∞ j=t+m+1 α j 2 ≤ 1 M T , then we arrive at the following Let δ = C √ log M T and take a union bound over the pM 2 entries of 1 T T −1 t=0 t X t , the conclusion follows.
Proof of Lemma 5.6. Without loss of generality, consider for any 1 ≤ i ≤ d m , and j ∈ D c m . Similar from the proof of Lemma 5.6, We can write it as a quadratic form 1 T where W * m is defined as in (48). Since 1 2 (W * m ) i· e j + e j (W * m ) i· is of rank 2, and we have bounded (W * m ) i· 2 in (62), applying Lemma A.3 leads to The following functions are defined as a smooth relaxation for indicator function. Let where C is a normalizing constant s.t.
is infinitely many times differentiable on R, and since f * (z) is constant when z ≤ 0 or z ≥ 1, for any fixed order, the derivative of f * (z) is bounded. For any z ∈ R d , let f l,µ,r,ε (z) = f * (g l,µ,r,ε (z)), where In the following proof, we will denote f l,µ,r,ε (z) and g l,µ,r,ε (z) as f l (z) and g l (z), l = 1, 2 for brevity. Therefore, Thus, Actually, when r ≤ 3ε, the right hand side of (68) can be substituted by and To bound E(f l (M n n+1 ) − f l (Z)), we will use the following lemma.
The proof of this lemma is deferred to Appendix E. In the following proof, we will always assume the condition l = 1 or l = 2 and r > 3ε hold. Therefore, for any m ∈ Z * , where u = z + t 1 y for some 0 ≤ t 1 ≤ 1. Meanwhile, where v = z + t 2 y for some 0 ≤ t 2 ≤ 1. Thus, for any δ > 0, Letw nk , 1 ≤ k ≤ n be i.i.d. standard Gaussian random vectors that are independent of G n,n+1 , w nk = (b nk ) 1 2w nk , for k = 1, · · · , n + 1, where b nk = E(m nk m nk |G n,k−1 ). Define Then W n 1 follows standard Gaussian distribution. Let U n k = M n k−1 + W n k+1 , then ).
Proof of Lemma C.2. First we introduce the following two norms: For any random variable X, These two norms are related to sub-exponential and sub-Gaussian random variables, and the following lemma shows the connections between the two norms and the scale factor for sub-Gaussian r.v.
Lemma D.3. For any sub-Gaussian r.v. X with scale factor τ , the following hold: with some absolute constants c, C, and This is an established result in Vershynin [2010]. By Lemma D.3, bounding W * m X t 2 2 ψ 1 would be sufficient, and we start from bounding E (exp {λ (W * m ) i· X t }) for any λ ∈ R. Recall that X t = Ψ . The relationship betweenα k and α k = Ψ k 2 can be established as follows:α if we define α i = 0 when i < 0. We now prove that exp {|λ| ∞ k=0 (W * m ) i· 2α k t−k 2 } is integrable so that we can use Dominated Convergence Theorem. Since ti 's are all independent sub-Gaussian random variables with parameter τ , where the second inequality is due to Minkowski's inequality. Thus, where the first equality is due to Monotone Convergence Theorem, and the last line is due to (62) and the fact that Therefore, by Dominated Convergence Theorem, j is defined in (65). Similar from the proof of Lemma 5.1, for any positive integer m, we can write down 1 T T −1 t=0 X t BX t as the following: Then we can bound each E i from its expectation separately, and m will be chosen to be sufficiently large later.
(1) Bounding E 1 − E(E 1 ) Let Θ (t) ∈ R pM ×(T +m)M and˜ ∈ R (T +m)M be defined as Then t=0 Θ (t) BΘ (t) ˜ , and by Lemma C.4 we only need to bound the operator norm and Frobenius norm of 1 For any unit vector u, v ∈ R (t+m)M , , and Γ ∈ R (t+m)×(t+m) be defined as Γ ij = ∞ k=0α |i−j|+kαk , then Thus we only need to bound Λ max (Γ). Applying Lemma C.5, the largest eigenvalue of Toeplitz matrix Γ can be bounded by where the third inequality is due to Cauchey-Schwartz inequality. Due to (75), we can further obtain Λ max (Γ) ≤2 where the fourth line is due to Cauchey-Schwartz inequality. Therefore, Now we apply Lemma C.4, and arrive at (2) Bounding E 2 − E(E 2 ) We will show that |E 2 − E(E 2 )| vanishes when m is large enough. First we bound E 2 ψ 1 . Since by (75) and (76), Meanwhile, For any δ > 0, let m be sufficiently large such that ∞ j=m−p α 2 j < δ 2p B tr , E 2 ψ 1 ≤ C B 2 T , then by tail bound of sub-exponential random variable (see Vershynin [2010]), (3) Bounding E 3 − E(E 3 ) One can show that The first line is due to the following fact: For any two sub-Gaussian random variables X and Y , XY ψ 1 ≤ 2 X ψ 2 Y ψ 2 . We can prove this in the following: where the first line applies Cauchey-Schwartz inequality. Thus, with large enough m, E 3 ψ 1 ≤ B 2 T . Also, E(E 3 ) = 0, therefore implies the same bound for E 3 − E(E 3 ) as the one for E 2 − E(E 2 ): In conclusion, for any δ > 0, if we choose some m accordingly, Proof of Lemma A.1. Here we apply some results in Basu et al. [2015] with a little change in notation. These results simplifies the original problem to finding a upper bound for v (H − Υ)v with any fixed unit vector v. Specifically, the following lemmas are useful: Lemma D.4. For any J ⊂ {1, · · · , pM }, and κ > 0, where K(l) = {v ∈ R pM : v 0 ≤ l, v 2 ≤ 1} for any positive integer l.
If we denote the non-zero eigenvalues of U as λ 1 , . . . , λ r , then