Performance bounds for parameter estimates of high-dimensional linear models with correlated errors

: This paper develops a systematic theory for high-dimensional linear models with dependent errors and/or dependent covariates. To study properties of estimates of the regression parameters, we adopt the framework of functional dependence measures ([43]). For the covariates two schemes are addressed: the random design and the deterministic design. For the former we apply the constrained (cid:2) 1 minimization approach, while for the latter the Lasso estimation procedure is used. We provide a detailed characterization on how the error rates of the estimates depend on the moment conditions that control the tail behaviors, the dependencies of the underlying processes that generate the errors and the covariates, the dimension and the sample size. Our theory substantially extends earlier ones by allowing dependent and/or heavy-tailed errors and the covariates. As our main tools, we derive exponential tail probability inequalities for dependent sub-Gaussian errors and Nagaev-type inequalities for dependent non-sub-Gaussian errors that arise from linear or non-linear processes.


Introduction
During the past two decades there has been a substantial development on highdimensional linear regression models. Consider the model where y i , x i and e i are the response variable, the p × 1 covariate vector and the error term respectively, and β is a p-dimensional regression parameter vector.
In matrix notation, we can write it as Y = Xβ + e, where Y is the n × 1 response vector, X is the n × p design matrix, and e is the n × 1 vector of errors. The covariate x i can be random or deterministic. Here the dimension p can be much larger than the sample size n. Clearly in this case the classical least squares method fails to estimate β since the matrix X X is singular. Under certain sparsity conditions on β = (β 1 , . . . , β p ) , namely if only a small number of components of β are non-zero, one can apply the 1 penalized least squares (Lasso) procedure [37]. A closely related approach is the Dantzig-type estimator [9], which is the optimizer of certain objective function under linear inequality constraints. Other variants include the SCAD estimator [13] and the MCP estimator [47], among others. Theoretical properties of those estimators have been extensively studied in the literature; see for example [4], and the recent book [5] for a thorough treatment and further references.
In most of the theoretical investigations of model (1.1), it is assumed that the errors (e i , i = 1, . . . , n) are independent and identically distributed (i.i.d.) Gaussian, sub-Gaussian or sub-exponential random variables which have finite exponential moments. Similar assumptions are also adopted for the covariates (x i , i = 1, ..., n) in the case of random design. The associated tools for obtaining performance bounds are the exponential-type concentration inequalities, including, among others, the Bennett, Bernstein and Hoeffding inequalities; see Chapter 14 of [5] for a review. With the help of such inequalities, assuming certain sparseness conditions, one can deal with the case in which p is much larger than n and still obtain consistency under the very mild condition of the type log p = o(n).
Despite the extensive literature on Lasso and Dantzig-type estimates, there has been very limited research on theoretical properties of the estimates when the errors (e i ) or the covariates (x i ) are dependent and/or non-sub-Gaussian. If the data are observed over time or space, the independence assumption for the errors (e i ) or the covariates (x i ) is violated. The sub-Gaussian assumption is also questionable as the errors and the covariates may be heavy-tailed and may not have finite exponential moments. In econometric analysis of vector autoregressive processes, in [34] Sims cautioned that fat tails can affect the validity of the associated statistical inference. In terms of dependent errors, [41] proposed a Lasso estimator when the errors follow an autoregressive model. Gupta [15] analyzed Lasso estimator for weakly dependent errors. Both papers mainly deal with the case where n is greater than p. Ravikumar et al [33] applied the Rosenthal's [35] inequality. Recently [18] studied Lasso with long memory errors with very light tails such that the Cramér condition is met, Loh [24] considered M -estimators for linear models with i.i.d. data, and in [3] Basu and Michailidis investigated theoretical properties of Lasso estimates for high-dimensional Gaussian processes.
The goal of this paper is threefold: (i) To lay a theoretical foundation for highdimensional inference in situations in which the errors or the covariates can be dependent; (ii) To develop sharp inequalities for tail probabilities for dependent and/or non-sub-Gaussian processes under mild and easily verifiable conditions; and (iii) To apply our inequalities to Lasso and constrained 1 minimization estimators for β of model (1.1). It is expected that our framework, inequalities and tools will be useful in other high-dimensional inference problems that involve dependent errors.
In our theoretical framework, we shall adopt the dependence concept of [43]. Assume that the errors (e i ) in (1.1) has the form e i = g(. . . , ε i−1 , ε i ), (1.2) where ε i , i ∈ Z, are i.i.d. random variables, and g(·) is a measurable function. It has a clear physical meaning, where (ε i ) are the inputs and (e i ) are the outputs. Such a representation is very natural for modeling time series. It was studied by [42] for representing stationary and ergodic processes, and it is sometimes called nonlinear Wold representation. The framework (1.2) is general enough to include a wide range of stochastic processes ( [38,44]). It subsumes linear processes, their nonlinear transforms, as well as the Volterra processes that involve interactions between the innovations. The representation (1.2) also includes recursive model of the form e i = G(e i−1 , ε i ), which includes Markov chain models and nonlinear autoregressive models such as threshold autoregressive models, autoregressive models with conditional heteroscedasticity (ARCH), exponential autoregressive models, bilinear autoregressive models etc. Therefore, there is no much loss of generality by assuming representation (1.2). One advantage of the representation (1.2) is that it enables us to define physically meaningful and easily workable dependent measures of the process (e i ). Since the inputs (ε i ) are i.i.d., all the dependencies among the outputs (e i ) are caused by the input-output transformation g(·). Therefore, we can define the dependence measures in terms of how the outputs are affected by the inputs, or how the change of the inputs leads to the change in the outputs. Specifically, assume e i q := (E|e i | q ) 1/q < ∞, q ≥ 1, define the functional dependence measure where F i = (· · · , ε i−1 , ε i ) and the coupled process with ε 0 , ε j , j ∈ Z, being i.i.d. Intuitively, δ i,q measures the dependency of e i on ε 0 , i.e., how replacing ε 0 by an i.i.d. copy while freezing all other innovations affects the output e i . We can interpret δ i,q as nonlinear impulse response function.
We shall assume short-range dependence so that Then for fixed m, Δ m,q measures the cumulative effect of ε 0 on (e i ) i≥m . Condition (1.5) assumes that the cumulative effect is finite. As a closely related concept, we define the predictive dependence measure where is the projection operator. Note that θ l,q also measures the input-output dependence in the representation (1.2) by quantifying how much the prediction of e l changes by concealing ε 0 from F 0 . Similar to Δ m,q , we can also define the cumulative predictive dependence measure Compared to the mixing conditions such as the α-, β-, φ-, and ρ-mixing in the literature, our δ i,q and θ l,q are more physically meaningful and easier to use. In many situations theoretical results based on them are optimal or nearly optimal. They lead to natural definitions for norms of random processes by adjusting for dependence; see the dependence-adjusted L q norm (2.8) and sub-exponential norm (2.21). Equipped with the above dependence measures, we shall establish inequalities for tail probabilities, including an exponential inequality and Nagaev-type inequalities for dependent random variables. The latter are generalizations of the classical Nagaev inequality [30] that deals with independent random variables. Using the functional and predictive dependence measures δ i,q and θ l,q introduced above, we show that if the dependence does not exceed a threshold, then our Nagaev-type inequality is as sharp as the original Nagaev inequality under independence. If the dependence of (e i ) is stronger, then the tail can be heavier and a correction should be used. Our form of tail probability inequalities is neat, easy-to-use and, in many cases, sharp.
With our probability inequalities as primary tools, we can analyze the properties of the Lasso and the constrained 1 minimization estimators under dependent and/or non-sub-Gaussian errors and covariates in the context where p is much larger than n. In comparison with the traditional situation where the errors (e i ) are i.i.d. with finite exponential moments, we shall show that the allowed range of the dimension p can be narrower in our setting, though it can still allow the high-dimensional situation with p > n. Roughly speaking, p can be at most a power of n if e i has only finite polynomial moment and the power is related to the moment condition of e i . Also the convergence can become slower due to the dependencies in errors as well as in the covariates. We shall give a detailed description on how the dependence measures and moment conditions of the errors and the covariates affect the rates of convergence and the selection consistencies of the estimators.
The rest of the paper is structured as follows. Section 2 presents exponential and Nagaev inequalities, using the framework of functional and predictive dependence measures. Section 3 deals with the constrained 1 minimization estimators in the random design scheme in which the covariate process (x i ) in (1.1) is a high-dimensional stationary process. Lasso estimators with deterministic covariates are treated in Section 4. In both sections we present rates of convergence, model selection consistency and support recovery results. A simulation study is carried out in Section 5. We now introduce some notation. For a matrix A = (a ij ) i≤I,j≤J , we define We use C, C 1 , C 2 , . . . to denote constants that do not depend on p and n and they may vary from place to place.

Probability inequalities under dependence
Exponential inequalities play a fundamental role in high dimensional inference. In this section we shall present new and powerful inequalities for tail probabilities of weighted sums of dependent and/or non-sub-Gaussian random variables. In Sections 2.1 and 2.2 we provide Nagaev inequalities (cf. Theorems 1 and 2) for linear and nonlinear processes, respectively. The processes can be non-sub-Gaussian ones that do not have finite exponential moments. If the error process satisfies stronger moment condition that it has finite moments of all orders, then under suitable dependence conditions we can have an exponential inequality which is optimal in view of [21]; cf Theorem 3 in Section 2.3. The functional dependence measure provides a convenient framework and it greatly facilitates the formulation of such inequalities.
The Nagaev inequality for tail probability is a useful result in probability theory. However, it appears little known in statistical community. As a result, some of the performance bounds obtained by the Markov inequality in the statistical literature under polynomial moment conditions are not sharp. Let X 1 , . . . , X n be mean 0 independent random variables with and c q = 2e −q (q + 2) −2 . By Corollary 1.7 in [30], for x > 0, the tail probability (2.1) Inequality (2.1) implies two types of bounds for P(|S n | ≥ x). For moderate deviation, if x 2 is around the variance μ n,2 , then one can use the Gaussian type tail. For large deviation, namely if x 2 is much bigger than the variance μ n,2 , then the polynomial tail dominates. If one applies the Markov and the Rosenthal [35] inequalities, one has Here we shall present probability inequalities for dependent errors. Consider the weighted sum of the form S n = a 1 e 1 + . . . + a n e n , where a 1 , . . . , a n are fixed coefficients. Write a = (a 1 , . . . , a n ) . In Theorems 1, 2 and 3 below we assume that |a| 2

Nagaev inequality for linear processes
To begin with, in Theorem 1, we assume that (e i ) follow a linear process where ε j , j ∈ Z, are i.i.d. with mean zero and ε j ∈ L q , q > 2, and f j are real coefficients with |f | 2 2 := ∞ j=0 f 2 j < ∞, so that by Kolmogorov's three series theorem e i exists. Linear processes are widely used in practice and they include the popular ARMA processes.
Then there exists constants C 1 , C 2 , only depend on q and β such that . So (2.5) follows.

Nagaev inequality for nonlinear processes
For nonlinear processes, with functional dependence measure Δ m,q in (1.5), we have Theorem 2, a Nagaev-type inequality for S n = a 1 e 1 + . . . + a n e n . The special case of Theorem 2 with a 1 = . . . = a n = 1 was treated in [23]. As an important improvement, in the stronger dependence case with slow decay of Δ m,q , Theorem 2 provides a sharper bound than the one in the latter paper. To account for dependence, for the process e · = (e i ) ∞ i=−∞ we introduce the following dependence adjusted norm (DAN) It can happen that, due to dependence, e · q,α = ∞ while e i q < ∞. Since e 0 = 0 l=−∞ P l e 0 , we have Δ 0,q = e · q,0 and by stationarity, Jensen's inequality and the fact that P 0 e i = E(e i − e * i |F 0 ). If e i ∈ L q are i.i.d. with mean 0, then δ i,q = 0 for all i ≥ 1, and δ 0,q = e 0 − e 0 q . Note that in this case the dependence-adjusted norm e · q,α and the L q norm e 0 q are equivalent in the sense that e 0 q ≤ δ 0,q ≤ e 0 q + e 0 q = 2 e 0 q .

359
The Nagaev inequality of form (2.9) provides a very natural extension of the classical one (2.1) in that the dependence-adjusted qth norm e · q,α plays the role of the L q norm X i q , while the dependence-adjusted 2nd order norm e · 2,α play the role of the L 2 norm X i 2 .
Proof of Theorem 2. In the proof the constants C 1 , C 2 , . . ., may change from line to line. They only depend on q and α, and are independent of n, (a i ) n i=1 and x. It suffices to deal with the case in which x ≥ √ n e · 2,α since otherwise (2.9) trivially holds. (2.10) Let S n,m = n k=1 a k e k,m and write The proof is based on the above decomposition. Note that the summands a k e k,0 of S n,0 are independent. By the Nagaev inequality (2.1), where c q is a constant only depending on q, and it may vary at each occurrence. By the Burkholder inequality [7], which in view of the Markov inequality implies we have by the Burkholder inequality and the Hölder inequality that (2.14) By the τ l -dependence, (2.14) and (2.1), A similar inequality holds for P (|B n | ≥ y). Since M n,l = A n + B n , By the definitions of τ l and λ l , we have φ := min l≥1 λ 2 l τ 2α l > 0. By elementary manipulations, there exists a constant C 7 > 1 such that for all u ≥ 1, we have Hence, by (2.11), (2.12), (2.13), (2.16) and (2.17), both cases with c < 0, c = 0 and c > 0 of Theorem 2 follow.

Remark 1.
In the stronger dependence case 0 < α < 1/2 − 1/q, when a 1 = . . . = a n = 1, [23] obtained the following inequality where C 1 , C 2 , C 3 are constants that may depend on the dependence condition Then the neater and simpler bound exp(−C 3 x 2 /(n e · 2 2,α )) in Theorem 2(ii) is sharper than the one in (2.18). Additionally our form (2.9) is easier to use since the constants C 1 , C 2 , C 3 therein only depend on α and q.

Remark 2.
To appreciate the sharpness of inequality (2.9), we assume that all a i > 0, e i are i.i.d. with mean 0, variance 1 and, for some constant h 0 > 0, is asymptotically exact. Hence inequality (2.9) provides a nearly optimal bound for both large and small x.

Exponential tail bounds
If e i satisfies stronger moment condition than the existence of finite qth moment, we expect that a stronger form than (2.9) exists. Indeed, we have the following Theorem 3, an exponential inequality, which is a generalization and an improvement of Theorem 2 in [43] by allowing weights and by providing an explicit close-form upper bound. Write Θ q = Θ 0,q . We shall assume stronger moment condition by allowing Θ q < ∞ for all q > 0, and we further assume that Θ q increases slower than Cq ν for some ν ≥ 0 in the following sense: In view of its definition, we can interpret Θ q as the predictive persistence of the process (e i ). In the very special case in which e i are i.i.d., we have Θ q = e 0 q and ν = 1 (resp. ν = 1/2) if e i are sub-exponential (resp. sub-Gaussian). In this case e · ψν is the sub-Gaussian or sub-exponential norms of a random variable; see for example Section 5.2.3 in [39]. Hence e · ψν can be naturally interpreted as dependence-adjusted sub-exponential or sub-Gaussian norm.

Remark 3. Note that condition (2.21) is equivalent to
(2.26) Let t 0 = (eαγ α ) −1 . By the argument in the proof of Theorem 3, we have Since γ ≤ γ 0 , the range t < t 0 is wider than the one t < t 0 in Theorem 3.

Remark 4.
The exponential inequality in Theorem 3 is optimal for martingale differences with finite exponential moment. Let D i , i ∈ Z, be a stationary martingale difference sequence with E(D 2 i ) = 1 and finite exponential moment By (2.23), there exists c 1 , c 2 > 0 such that for all x > 0, by letting u = √ nx. In comparison with Theorem 3.2 in [21], (2.28) is optimal up to a constant. They also proved that the inequality P(|S n | ≥ n) ≤ c 1 exp(−c 2 n 1/3 ) is the best possible under the condition E exp(|D 0 |) < ∞. If D i is bounded (say by b > 0), using Azuma's inequality [1], one can have the bound P(|S n | ≥ nx) ≤ 2 exp(−nx 2 /(2b 2 )). Again in this case our inequality (2.23) is sharp up to a multiplicative constant by letting α = 2 since Θ q = D 0 q ≤ b for all q, so that (2.21) holds with γ = b.
In our random design setting we assume that in model (1.1) the covariate process (x i , i = 1, ..., n) is high-dimensional stationary of the form where (ε i ) are i.i.d. random vectors, and h(·) = (h 1 (·), . . . , h p (·)) is a measurable function in R p . With x i defined in (3.2), letting ε i be i.i.d. random vectors, we can allow models with homogeneous or heteroscedastic errors; see Example 2. In the former homogeneous case, the covariance process (x i ) and the error process (e i ) can be independent to each other. Similar to δ i,q in (1.3), assume that x i ∈ L ι , ι > 2, and we define the functional dependence measure Similar to Δ m,q , we can define and assume Two important cases of vector autoregressive processes and linear stochastic models are given in Examples 1 and 2 below and they are widely used in practice.
For each j ≤ p, consider the linear regression model which is of the form (1.1) with Z i,j and x i being independent. Here b j is a pd × 1 parameter vector. Observe that constrained 1 minimization estimation of the coefficient matrices A 1 , . . . , A d of model (3.5) can be decomposed into the p sub-problems of estimating b j , 1 ≤ j ≤ p, of (3.6), thus allowing for parallel computation; see also [8,16] for similar treatments.

Example 2 (Linear stochastic models with heteroscedastic errors). Let
with mean 0, variance 1, ξ i , i ∈ Z, are also i.i.d. random vectors and (η i ) and (ξ i ) are independent. Let where h 0 (·) and σ(·) are measurable functions such that x i and e i are properly defined. It is clear that by choosing appropriate h(·) and g(·), x i and e i can be written in the form of (3.2) and (1.2), respectively. If σ(·) ≡ a constant, then x i and e i are independent. If σ(·) is not a constant function, then the errors e i and x i are dependent and thus (1.1) is a model with heteroscedastic errors.

9)
then for all a > 0, b ≥ λ/n, where C 1 , C 2 > 0 and the constant in only depend on ν and .
Before proving Theorem 4, we shall provide two examples of high-dimensional time series for which one can bound N X = max j≤p x ·j ι,α X , a key step in applying this theorem.
Example 3 (High-dimensional linear process). Let ζ ij , i, j ∈ Z, be i.i.d. random variables with mean 0, variance 1 and having finite ιth moment, ι > 2; let A 0 , A 1 , . . . , be p × p matrices with real entries such that ∞ j=0 tr(A j A j ) < ∞. Write ε i = (ζ i1 , . . . , ζ ip ) . Then by Kolmogorov's three series theorem (see for example Corollary 5.1.3 in [11]) the p-dimensional linear process is well-defined. The above process is a special case of (3.2) with a linear functional h(·). Let A l,j be the jth row of A l . Then by Burkholder's inequality, where the constant c only depends on θ and ι.

Remark 5.
The argument in the proof of Theorem 4 implies that, if λ = λ n is chosen such that (pn π ) 1/τ = o(λ n ) and λ n ≥ C(n log p) 1/2 for a sufficiently large constant C, then for the true parameter β, |X Xβ − X Y | ∞ ≤ λ n holds with probability going to 1. Namely, for event B with b = λ/n, P(B) → 1.

Remark 6.
If the two processes (e l ) and (x l ) are independent of each other, then we can let τ = min(q, ι). Additionally for the model (3.6) in Example 1, if Z i ∈ L q with q > 2, then ι = q and we can let τ = q instead of q/2, since e l x lj ∈ L q , and the functional dependence measure for (e l x lj ) decays exponentially.
The bound (3.8) reveals two different decay behaviors: if a is small, let χ = 1, then the Gaussian-type bound e −C1na 2 dominates. On the other hand, if it is large, then the dominating term is the polynomial tail p 2 n/(na) ι/2 . A similar claim can be made for the term involving b. The borderline of this phase-transition phenomenon is at a = a n and b = b n , where a n (resp. b n ) is the solution to the equation n(na) −ι/2 = e −na 2 (resp. n(nb) −τ = e −nb 2 ). Theorem 4 immediately leads to the following result on the rate of convergence and support recovery.
Proof of Corollary 1.
Since Ω X Σ X = Id, we have Then (3.25) follows from Theorem 4.
The setting in our Theorem 4 and Corollary 1 is very general as it allows dependent and/or non sub-Gaussian error processes and it also allows heteroscedasticity in that the error process and the covariance process can be dependent. Han and Liu [16] considered the special case of the estimation of A 1 , . . . , A d of model (3.5) under the assumption that Z i are i.i.d. Gaussian. Sims [34] mentioned several challenges in the inference of vector autoregressive models: the possibility of fat tails in the innovations and the low degrees of freedom due to the estimation of possibly extremely many parameters. The latter problem has been widely recognized in the analysis of vector autoregressive processes; see for example [17,20,2] among others. Our setting allows both fat tails and large number of parameters. Additionally, by checking non-zero entries in the estimateβ, we can infer economic relations between variables, a theory-free principle that was advocated in [34].

LASSO estimators with deterministic design
Following [37], one can estimate the unknown parameter vector β by minimizing the criterion function where λ > 0 is the regularization parameter. In this section we assume that x i is deterministic and (e i ) is of the form (1.2). For convenience, we scale the diagonal entries of the Gram matrix Ψ n = X X/n to be 1. Then |X| 2 = (np) 1/2 . Let β be the minimizer of (4.1). Consistency properties ofβ are discussed in [4,5], where e i are i.i.d. and sub-Gaussian.

Convergence rate of the Lasso estimator
Equipped with the probability inequalities established in Section 2, we shall study properties of the Lasso estimatorβ in (4.1), in particular the rate of convergence ofβ−β. Let X be the design matrix and write (1.1) as Y = Xβ +e. We assume that the true parameter vector β has at most s non-zero entries. We shall also assume that the restricted eigenvalue assumption RE(s, 3) in [4] holds with constant κ = κ(s, 3), namely κ(s, c 0 ) := min J⊆{1,...,p}, |J|≤s holds with c 0 = 3, where u J is defined as a modification of u by setting its elements outside J to zero. This condition is weaker than the restricted isometry property of [9]. See also [6,28,48,49] for related conditions. Theorem 5 shows how the rate of convergence of |β − β| 1 and the prediction error |X(β−β)| 2 2 depend on q and α, which correspond to the moment condition and the dependence condition respectively. Let the regularization parameter λ = 2r. Theorem 5. Assume (4.2). Assume that the error sequence (e i ) has finite qth moment, q > 2, and dependence-adjusted norm e · q,α < ∞, Then with probability at least

4)
and then (4.4) and (4.5) hold with the same probability as in (i).
Proof. As in the proof of Lemma B.1 in [4], sinceβ minimizes (4.1), we have Let δ =β − β. On the event inequality (4.7) implies that So (4.4) and (4.5) follow if we can control the probability P(A). For (i), by Inequality (2.9) of Theorem 2 with n = 1, we have Under our choice of r, we have Following the argument in [4], with probability at least 1 − C 1 B −q − C 2 p 1−C3A 2 , we have (4.4) and (4.5). Case (ii) can be similarly proved.
Theorem 5 indicates how the dimension breaks down if the moment condition weakens or the dependence becomes stronger. Assume that |X| q (np) 1/q and e · q,α < ∞, α > 1/2 − 1/q, then by (4.3), the requirement r → 0 implies necessarily that (np) 1/q /n → 0, or p = o(n q−1 ). In comparison, if e i are i.i.d. sub-Gaussian, then the condition log p = o(n) suffices. In the stronger dependence case (ii), the more restrictive condition p = o(n qα+q/2 ) is needed. The latter range on p is substantially narrower.
It is interesting to compare the two terms in r. Assume that |X| q (np) 1/q . In the relatively low dimensional case with p ≤ n q/2−1 (log n) q/2 , the Gaussian part, which corresponds to (n −1 log p) 1/2 , is larger. On the other hand, if the dimension p is large such that p > n q/2−1 (log n) q/2 , then the tail part n −1 |X| q n 1/q−1 p 1/q dominates and it is larger than the penalty of the form A(n −1 log p) 1/2 that is used in the Gaussian errors case.
For sub-Gaussian errors, we have the following theorem.
The proof of this theorem is similar to the corresponding result in [4], and is omitted.

Model selection consistency
In this section we shall consider sign consistency for model selection based on the Lasso. Zhao and Yu [46] introduced the concept of sign consistency; see also [40,27]. We write β = (β 1 , . . . , β s , β 1+s , . . . , β p ) , where β i = 0 if i ≤ s and β i = 0 if i > s. Correspondingly we write X = (X 1 , X 2 ), where X 1 is the n × s sub-matrix that corresponds to the predictors with non-zero coefficients, and X 2 is the remaining n × (p − s) sub-matrix. We scale the diagonal entries of the Gram matrix Ψ n = X X/n to be 1. We have the following Theorem 7 which extends the result in [46] to models with dependent errors. Note that, even in the independence case, our bound is sharper. We use the same conditions as those in [46]. The quantity η ∈ (0, 1) in Theorem 7 is from the strong irrepresentable condition in [46]. Namely Here the sign function sign(u) = 1 if u > 0, −1 if u < 0 and 0 if u = 0.
In the special case in which e i are i.i.d., a slightly improved version of Theorem 7 can be obtained. Let μ q = e i q , Γ 1 = ( n j=1 max i≤s |a ij | q ) 1/q and Γ 2 = ( n j=1 max i≤p−s |b ij | q ) 1/q . Note that Γ 1 ≤ |H 1 | q and Γ 2 ≤ |H 2 | q .

A simulation study
This section presents a simulation study to illustrate the effects of heavy tails and dependencies of the error and/or the covariate processes on the performance of the Clime and the Lasso estimators, with stochastic and deterministic designs, respectively. For the former, we consider model (1.1) with the residual process where η i are i.i.d. Student t κ random variables with degrees of freedom κ > 2 and −1 < ρ < 1. Note that the parameters ρ and κ controls the dependence and the heaviness of the tails of (e i ), respectively. Observe that the e i has mean 0 and variance 1. For the regressor process x i , we let  N (0, 1) variables. We choose a realization of A such that its spectral norm A 2 < 1, so that (5.2) has a stationary solution. Once A is simulated, we keep it throughout the simulation.
We choose n = 25, p = 100, β = (β 1 , . . . , β p ) with β j = 1 if j ≤ 5 and β j = 0 if j > 5. In our simulation we use the R flare package by [23] to compute the Clime estimate. To study how the dependence and the heavy tails affect the convergence speed, we consider the tail probability ratio function is the Clime estimator of β in model (1.1) with x i being i.i.d. R p standard normal random vectors, and e i are i.i.d. N(0, 1) random variables, and β is the Clime estimator with error process (5.1) and regressor process (5.2) with different dependence and tail conditions. The denominator in (5.3) can be viewed as benchmark probabilities. The flare program suggests that the threshold value λ is around 0.6. Hence in our simulation we use λ = 0.6. In the benchmark setting, based on 10 6 repetitions, the 99% and 99.9% quantiles of the L 1 distance |β † − β| 1 are estimated as 11.781 and 12.495, respectively. Table 1 presents the simulated values of the tail probability ratio function R 1 (t) with t = 11.781 and t = 12.495, which correspond to the ratio between P(|β − β| 1 ≥ t) under various dependence and moment conditions and the benchmark tail probabilities 0.01 and 0.001, respectively. For each different combinations of (ρ, ν, κ), the tail probability P(|β − β| 1 ≥ t) is estimated by the proportions of the 5000 values of |β − β| 1 that are larger than t. Table 1 suggests the following phenomena, as expected from our theoretical results: (i) heavier tails (smaller ν or κ) can lead to larger P(|β − β| 1 ≥ t), thus inflating the tail probability ratio R 1 ; (ii) the upper tail probability P(|β − β| 1 ≥ t) with larger t is affected more than the one with smaller t. For example, with (ρ, ν, κ) = (−.75, 3, 3), the latter probability can be R 1 (t) = 61.8 times larger than the nominal level 0.001, as obtained based on i.i.d. standard normal distributions.
For Lasso estimation with fixed design, we shall also focus on the tail behavior of the 1 error of the estimated parameters. We set p = 800, and n = 100, 200, and 400. In each of the three (n, p) combinations, we generate the n × p design  matrix X whose entries x ij are iid N(0, 1) random variables. We fix the design matrix for all the 3,000 repetitions of the simulation study. For the true value of the parameter vector β, we let the first s elements be non-zero, and the rest of the elements be zero. We set s = 10. For the non-zero elements, with 1/2 probability, β j = b, and with 1/2 probability, β j = −b. We set b = 10. We fix the marginal variance of e i at Var(e i ) = 5 2 . We use the R glmnet package for the Lasso computation [14]. We adopt the option that the intercept is set to 0. We first experiment with independent errors, where e i are i.i.d. student t ν , which has a polynomial tail. Note that Var(e i ) = ν/(ν − 2), which we use to normalize the variance of e i . We examine the performance of the Lasso over a range of values of the regularization parameter λ. For each value of λ, we compute the 99% quantile of the 1 error |β − β| 1 . The quantiles are estimated from 3,000 repetitions. Figure 1 shows the results for ν = 100, 3 and 2.5, for different values of n (n = 100, 200, 400 respectively). t 100 is close to normal. It can be seen that the tail of the 1 error becomes heavier as ν decreases.
We then continue to study the behavior of the Lasso for dependent and heavy tailed errors. We consider a non-linear autoregressive model. We first generate a Gaussian autoregressive processẽ i = ρẽ i−1 + 1 − ρ 2 i , where i are i.i.d. N(0, 1). So marginallyẽ i ∼ N(0, 1). We then let e i = g(ẽ i ). The non-linear transformation g(x) = F −1 (Φ(z)) transformsẽ i into a random variable e i that follows t ν with ν = 2.5, where F is the cdf of t ν , and Φ is the standard normal cdf. We then normalize the marginal variance so that Var(e i ) = 5 2 . Figure 2 shows the results for ρ = 0, .5, . 9. As the autocorrelation becomes stronger or the sample size n gets smaller, the tail of the 1 error becomes heavier.
Our simulation studies show that the performance of the Lasso deteriorates if the errors have a heavy tail distribution and if there are dependencies among the errors. This is qualitatively consistent with the theoretical results we have obtained, although it is difficult for the simulation studies to capture the rate of the tail decay quantitatively.