Variable Screening for High Dimensional Time Series

Variable selection is a widely studied problem in high dimensional statistics, primarily since estimating the precise relationship between the covariates and the response is of great importance in many scientific disciplines. However, most of theory and methods developed towards this goal for the linear model invoke the assumption of iid sub-Gaussian covariates and errors. This paper analyzes the theoretical properties of Sure Independence Screening (SIS) (Fan and Lv [J. R. Stat. Soc. Ser. B Stat. Methodol. 70 (2008) 849-911]) for high dimensional linear models with dependent and/or heavy tailed covariates and errors. We also introduce a generalized least squares screening (GLSS) procedure which utilizes the serial correlation present in the data. By utilizing this serial correlation when estimating our marginal effects, GLSS is shown to outperform SIS in many cases. For both procedures we prove sure screening properties, which depend on the moment conditions, and the strength of dependence in the error and covariate processes, amongst other factors. Additionally, combining these screening procedures with the adaptive Lasso is analyzed. Dependence is quantified by functional dependence measures (Wu [Proc. Natl. Acad. Sci. USA 102 (2005) 14150-14154]), and the results rely on the use of Nagaev-type and exponential inequalities for dependent random variables. We also conduct simulations to demonstrate the finite sample performance of these procedures, and include a real data application of forecasting the US inflation rate.


Introduction
With the advancement of data acquisition technology, high dimensionality is a characteristic of data being collected in fields as diverse as health sciences, genomics, neuroscience, astronomy, finance, and macroeconomics.Applications where we have a large number of predictors for a relatively small number of observations are becoming increasingly common.For example, in disease classification we usually have thousands of variables, such as expression of genes, which are collected, while the sample size is usually in the tens.Other examples include fMRI data, where the number of voxels can number in the thousands and far outnumber the observations.For an overview of high dimensionality in economics and finance, see [22].For the biological sciences, see [24,5] and references therein.The main goals in these situations according to [4] are: • To construct as effective a method as possible to predict future observations.• To gain insight into the relationship between features and response for scientific purposes, as well as hopefully, to construct an improved prediction method.
More formally we are dealing with the case where with y = (Y 1 , . . ., Y n ) T being an n-vector of responses, X = (x 1 , . . ., x n ) T being an n × p n random design matrix, and = ( 1 , . . ., n ) T is a random vector of errors.In addition, when the dimensionality of the predictors (p n ) is large we usually make the assumption that the underlying coefficient vector (β) is sparse.Sparsity is a characteristic that is frequently found in many scientific applications [20], [31].For example, in disease classification it is usually the case that only a small amount of genes are relevant to predicting the outcome.
Indeed, there are a wealth of theoretical results and methods that are devoted to this issue.Our primary focus is on screening procedures.Sure Independence Screening (SIS) as originally introduced in [20], was applicable to the linear model, and is based on a ranking of the absolute values of the marginal correlations of the predictors with the response.This method allows one to deal with situations in which the number of predictors is of an exponential order of the number of observations, which they termed as ultrahigh dimensionality.Further work on the topic has expanded the procedure to cover the case of generalized linear models [25], non-parametric additive models [19], Cox proportional hazards model [18], single index hazard rate models [27], and varying coefficient models [23].Model-free screening methods have also been developed.For example; screening using distance correlation was analyzed in [35], a martingale difference correlation approach was introduced in [42], additional works include [57], [30] among others.For an overview of works related to screening procedures, one can consult [37].The main result introduced with these methods is that, under appropriate conditions, we can reduce the predictor dimension from size p n = O (exp (n α )), for some α < 1, to a size d n , while retaining all the relevant predictors with probability approaching 1.
Another widely used class of methods is based on the penalized least squares approach.An overview of these methods is provided in [21] and [5].Examples of methods in this class are the Lasso [45], and the adaptive Lasso [58].Various theoretical results have been discovered for these class of methods.They broadly fall into analyzing the prediction error |X( β − β)| 2  2 , parameter estimation error | β − β| 1 , model selection consistency, as well as limiting distributions of the estimated parameters (see [9] for a comprehensive summary).Using screening procedures in conjunction with penalized least squares methods, such as the adaptive Lasso, presents a powerful tool for variable selection.Variable screening can allow us to quickly reduce the parameter dimension p n significantly, which weakens the assumptions needed for model selection consistency of the adaptive Lasso [29,39].
A key limitation of the results obtained for screening methods, is the assumption of independent observations.In addition, it is usually assumed that the covariates and the errors are sub-Gaussian (or sub-exponential).However, there are many examples of real world data where these assumptions are violated.Data which is observed over time and/or space such as meteorological data, longitudinal data, economic and financial time series frequently exhibit covariates and/or errors which are serially correlated.One specific example is the case of fMRI time series, where there can exist a complicated spatial-temporal dependence structure in the errors and the covariates (see [48]).Another example is in forecasting macroeconomic indicators such as GDP or inflation rate, where we have large number of macroeconomic and financial time series, along with their lags, as possible covariates.Examples of heavy tailed and dependent errors and covariates can be found most prominently in financial, insurance and macroeconomic data.
These examples stress why it is extremely important for variable selection methods to be capable of handling scenarios where the assumption of independent sub-Gaussian (or sub-exponential) observations is violated.Some works related to this goal for the Lasso include [46], which extended the Lasso to jointly model the autoregressive structure of the errors as well as the covariates.However, their method is applicable only to the case where p n < n, and they assume an autoregressive structure where the order of the process is known.Whereas [53] studied the theoretical properties of the Lasso assuming a fixed design in the case of heavy tailed and dependent errors.Additionally [3], and [33] investigated theoretical properties of the Lasso for high-dimensional Gaussian processes.Most recently [39] analyzed the adaptive Lasso for high dimensional time series while allowing for both heavy tailed covariate and errors processes, with the additional assumption that the error process is a martingale difference sequence.
Some works related to this goal for screening methods include [36], which allows for heavy tailed errors and covariates.Additionally [10], [54], and [57] also relax the Gaussian assumption, with the first two requiring the tails of the covariates and the response to be exponentially light, while the latter allows for heavy tailed errors provided the covariates are sub-exponential.Although these works relax the moment and distributional assumptions on the covariates and the response, they still remain in the framework of independent observations.A few works have dealt with correlated observations in the context of longitudinal data (see [13], [56]).However, the dependence structure of longitudinal data is too restrictive to cover the type of dependence present in most time series.Most recently [12] proposed a non-parametric kernel smoothing screening method applicable to time series data.In their work they assume a sub-exponential response, covariates that are bounded and have a density, as well as assuming the sequence {(Y i , x i )} is strong mixing, with the additional assumption that the strong mixing coefficients decay geometrically.These assumptions can be quite restrictive; they exclude, for example, heavy tailed time series, and discrete valued time series which are common in fields such as macroeconomics, finance, neuroscience, amongst others [16].
In this work, we study the theoretical properties of SIS for the linear model with dependent and/or heavy tailed covariates and errors.This allows us to substantially increase the number of situations in which SIS can be applied.However, one of the drawbacks to using SIS in a time series setting is that the temporal dependence structure between observations is ignored.In an attempt to correct this, we introduce a generalized least squares screening (GLSS) procedure, which utilizes this additional information when estimating the marginal effect of each covariate.By using GLS to estimate the marginal regression coefficient for each covariate, as opposed to OLS used in SIS, we correct for the effects of serial correlation.Our simulation results show the effectiveness of GLSS over SIS, is most pronounced when we have strong levels of serial correlation and weak signals.Using the adaptive Lasso as a second stage estimator after applying the above screening procedures is also analyzed.Probability bounds for our combined two stage estimator being sign consistent are provided, along with comparisons between our two stage estimator and the adaptive Lasso as a stand alone procedure.
Compared to previous work, we place no restrictions on the distribution of the covariate and error processes besides existence of a certain number of finite moments.In order to quantify dependence, we rely on the functional dependence measure framework introduced by [49], rather than the usual strong mixing coefficients.Comparisons between functional dependence measures and strong mixing assumptions are discussed in section 2. For both GLSS and SIS, we present the sure screening properties and show the range of p n can vary from the high dimensional case, where p n is a power of n, to the ultrahigh dimensional case discussed in [20].We detail how the range of p n and the sure screening properties are affected by the strength of dependence and the moment conditions of the errors and covariates, the strength of the underlying signal, and the sparsity level, amongst other factors.
The rest of the paper is organized as follows: Section 2 reviews the functional and predictive dependence measures which will allow us to characterize the dependence in the covariate (x i , i = 1, ..., n) and error processes.We also discuss the assumptions placed on structure of the covariate and error processes; these assumptions are very mild, allowing us to represent a wide variety of stochastic processes which arise in practice.Section 3 presents the sure screening properties of SIS under a range of settings.Section 4 introduces the GLSS procedure and presents its sure screening properties.Combining these screening procedures with the adaptive Lasso will discussed in Section 5. Section 6 covers simulation results, while section 7 discusses an application to forecasting the US inflation rate.Lastly, concluding remarks are in Section 8, and the proofs for all the results follow in the appendix.

Preliminaries
We shall assume the error sequence is a strictly stationary, ergodic process with the following form: Where g(•) is a real valued measurable function, and e i are iid random variables.This representation includes a very wide range of stochastic processes such as linear processes, their non-linear transforms, Volterra processes, Markov chain models, non-linear autoregressive models such as threshold auto-regressive (TAR), bilinear, GARCH models, among others (for more details see [50], [49]).This representation allows us to use the functional and predictive dependence measures introduced in [49].The functional dependence measure for the error process is defined as the following: where F * i = (. . ., e −1 , e * 0 , e 1 , . . ., e i ) with e * 0 , e j , j ∈ Z being iid.Since we are replacing e 0 by e * 0 , we can think of this as measuring the dependency of i on e 0 as we are keeping all other inputs the same.The cumulative functional dependence measure is defined as ∆ m,q ( ) = ∞ i=m δ q ( i ).We assume weak dependence of the form: The predictive dependence measure is related to the functional dependence measure, and is defined as the following: where F i = (. . ., e −1 , e 0 , e 1 , . . ., e i ) with e i , i ∈ Z being iid.The cumulative predictive dependence measure is defined as Θ q ( ) = ∞ l=0 θ q ( l ), and by Theorem 1 in [49] we obtain Θ q ( ) ≤ ∆ 0,q ( ).
Similarly the covariate process is of the form: Where η The superscript (n) denotes that the dimension of vectors is a function of n, however for presentational clarity we suppress the superscript (n) from here on and use x i and η i instead.Let As before the functional dependence measure is δ q (X ij ) = ||X ij − h j (H * i ) || q and the cumulative dependence measure for the covariate process is defined as: The representations (2), and (6), along with the functional and predictive dependence measures have been used in various works including [52], [55], and [53] amongst others.Compared to strong mixing conditions, which are often difficult to verify, the above dependence measures are easier to interpret and compute since they are related to the data generating mechanism of the underlying process [50].In many cases using the functional dependence measure also requires less stringent assumptions.For example, consider the case of a linear process, i = ∞ j=0 f j e i−j , with e i iid.Sufficient conditions for a linear process to be strong mixing involve: the density function of the innovations (e i ) being of bounded variation, restrictive assumptions on the decay rate of the coefficients (f j ), and invertibility of the process (see Theorem 14.9 in [14] for details).Additional conditions are needed to ensure strong mixing if the innovations for the linear process are dependent [17].
As a result many simple processes can be shown to be non-strong mixing.A prominent example involves an AR(1) model with iid Bernoulli (1/2) inno- . These cases can be handled quite easily in our framework, since we are not placing distributional assumptions on the innovations, e i , such as the existence of a density.For linear processes with iid innovations, representation (2) clearly holds and ( 4) is satisfied if For dependent innovations, suppose we have: e i = h(. . ., a i−1 , a i ), where h(•) is a real valued measurable function and a i , i ∈ Z, are iid.Then i = ∞ j=0 f j e i−j , has a causal representation, and satisfies (4) if: [51]).

SIS with Dependent Observations
Sure Independence Screening, as introduced by Fan and Lv [20], is a method of variable screening based on ranking the magnitudes of the p n marginal regression estimates.Under appropriate conditions, this simple procedure is shown to possess the sure screening property.The method is as follows, let: Therefore, ρj is the OLS estimate of the linear projection of Y t onto X tj .Now let and let |M * | = s n << n be the size of the true sparse model.We then sort the elements of ρ by their magnitudes.For any given γ n , define a sub-model and let | Mγn | = d n be the size of the selected model.The sure screening property states that for an appropriate choice of γ n , we have P M * ⊂ Mγn → 1.
Throughout this paper let: Y t = pn i=1 X ti β i + t , x t = (X t1 , ..., X tpn ), Σ = cov(x t ), and X k be k th column of X.In addition, we assume V ar(Y t ), V ar(X tj ) = O(1), ∀j ≤ p n .Note that x t can contain lagged values of Y t .Additionally, let ρ j = (E(X 2 tj )) −1 E(X tj Y t ), and For a vector a = (a 1 , ..., a n ), sgn(a) denotes its sign vector, with the convention that sgn(0) = 0, and |a| p p = n i=1 |a i | p .For a square matrix A, let λ min (A) and λ max (A), denote the minimum eigenvalue, and maximum eigenvalue respectively.For any matrix A, let ||A|| ∞ , and ||A|| 2 denote the maximum absolute row sum of A, and the spectral norm of A respectively.Lastly we will use C, c to denote generic positive constants which can change between instances.
Condition A is standard in screening procedures, and it assumes the marginal signals of the active predictors cannot be too small.Condition B assumes the covariates and the errors are contemporaneously uncorrelated.This is significantly weaker than independence between the error sequence and the covariates usually assumed.Condition C presents the structure, dependence and moment conditions on the covariate and error processes.Notice that higher values of α x , α indicate weaker temporal dependence.
Examples of error and covariate processes which satisfy Condition C are: If i is a linear process, i = ∞ j=0 f j e i−j with e i iid and ) and α = β − 1.We have a geometric decay rate in the cumulative functional dependence measure, if i satisfies the geometric moment contraction (GMC) condition, see [41].Conditions needed for a process to satisfy the GMC condition are given in Theorem 5.1 of [41].Examples of processes satisfying the GMC condition include stationary, causal finite order ARMA, GARCH, ARMA-GARCH, bilinear, and threshold autoregressive processes, amongst others (see [50] for details).
For the covariate process, if we assume x i is a vector linear process: Where A l are p n × p n coefficient matrices and η i = (η i1 , . . ., η ipn ) are iid random vectors with cov(η i ) = Σ η .For simplicity, assume η i,j (j = 1, . . ., p n ) are identically distributed, then where A i,j is the j th column of In particular for stable VAR(1) processes, 2 ) [11].For stable VAR(k) processes, And by section 11.3.2 in [38], the process xt is stable if and only if x t is stable.Therefore if B1 is diagonalizable, we have O(a m ), where a represents the largest eigenvalue in magnitude of B1 .And by the stability of x t , a ∈ (0, 1).Additional examples of error and covariate processes which satisfy Condition C are given in [52] and [53] respectively. Define For ease of presentation we let: The following theorem gives the sure screening properties, and provides a bound on the size of the selected model: Theorem 1. Suppose Conditions A,B,C hold.
(i) For any c 2 > 0, we have: In Theorem 1 we have two types of bounds, for large n the polynomial terms dominate, whereas for small values of n the exponential terms dominate.The covariate dimension (p n ) can be as large as o(min( sn(n/sn) r/2−rκ/2 n ω , n τ −τ κ n ι )).The range of p n depends on the dependence in both the covariate and the error processes, the strength of the signal (κ), the moment condition, and the sparsity level (s n ).If we assume s n = O(1), r = q, and α ≥ 1/2 − 2/r then p n = o(n r/2−rκ/2−1 ).For the case of iid errors and covariates, we would replace K x,r , K ,q in Theorem 1 with max j≤pn ||X ij || r/2 and || i || q respectively.Therefore for the case of weaker dependence in the covariate and error processes (i.e.α x > 1/2−2/r and α > 1/2−1/ ), our range for p n is reduced only by a constant factor.However, our range for p n is significantly reduced in the case of stronger dependence in the error or covariate processes (i.e.either α x < 1/2 − 2/r or α < 1/2 − 2/q).For instance if α x = α and q = r, our range for p n is reduced by a factor of n r/4−αr/2 in the case of stronger dependence.
In the iid setting, to achieve sure screening in the ultrahigh dimensional case, [20] assumed the covariates and errors are jointly normally distributed.Future works applicable to the linear model, such as [25], [19] among others, relaxed this Gaussian assumption, but generally assumed the tails of the covariates and errors are exponentially light.Compared to the existing results for iid observations, our moment conditions preclude us from dealing with the ultrahigh dimensional case.However, our setting is far more general in that it allows for dependent and heavy tailed covariates and errors.In addition, we allow for the covariates and error processes to be dependent on each other, with the mild restriction that E(X tj t ) = 0, ∀j ≤ p n .

Ultrahigh Dimensionality under dependence
It is possible to achieve the sure screening property in the ultrahigh dimensional setting with dependent errors and covariates.However, we need to make stronger assumptions on the moments of both the error and covariate processes.Until now we have assumed the existence of a finite qth moment, which restricted the range of p to a power of n.If the error and covariate processes are assumed to follow a stronger moment condition, such as ∆ 0,q ( ) < ∞ and Φ 0,q (x) < ∞ for arbitrary q > 0, we can achieve a much larger range of p n which will cover the ultrahigh dimensional case discussed in [20].More formally, we have: Condition D: Assume the error and the covariate processes have representations (2), and ( 6) respectively.Additionally assume υ x = sup q≥2 q − αx Φ 0,q (x) < ∞ and υ = sup q≥2 q − α ∆ 0,q ( ) < ∞, for some αx , α ≥ 0. By Theorem 3 in [53], Condition D implies the tails of the covariate and error processes are exponentially light.There are a wide range of processes which satisfy the above condition.For example, if i is a linear process: i = ∞ j=0 f j e i−j with e i iid and Similarly if e i is sub-exponential we have α = 1.More generally, for i = ∞ j=0 f j e p i−j , if e i is sub-exponential, we have α = p.Similar results hold for vector linear processes discussed previously.
Condition D is primarily a restriction on the rate at which || i || q , max j≤pn ||X ij || q increase as q → ∞.We remark that, for any fixed q, we are not placing additional assumptions on the temporal decay rate of the covariate and error processes besides requiring ∆ 0,q ( ),Φ 0,q (x) < ∞.In comparison, in the ultrahigh dimensional setting, [12] requires geometrically decaying strong mixing coefficients, in addition to requiring sub-exponential tails for the response.As an example, if we assume i = ∞ j=0 f j e i−j , geometrically decaying strong mixing coefficients would require the coefficients, f j , to decay geometrically.Whereas in Condition D, the only restrictions we place on the coefficients, f j , is absolute summability.
Theorem 2. Suppose Conditions A,B,D hold.Define α = 2 1+2 αx+2 α , and α = 2 1+4 αx .(i) For any c 2 > 0 we have: , we have: , we have: From Theorem 2, we infer the covariate dimension (p n ) can be as large as As in Theorem 1, the range of p n depends on the dependence in both the covariate and the error processes, the strength of the signal (κ), the moment condition, and the sparsity level (s n ).
For the case of iid covariates and errors, we would replace υ x and υ with µ r/2 = max j≤pn ||X ij || r/2 and || i || q respectively.In contrast to Theorem 1, temporal dependence affects our range of p n only by a constant factor.If we assume s n = O(1), and both the covariate and error processes are sub-Gaussian we obtain p n = o(exp(n )), while for sub-exponential distributions we obtain p n = o(exp(n )).In contrast, Fan and Lv [20], assuming independent observations, allow for a larger range p n = o(exp(n 1−2κ ).However, their work relied critically on the Gaussian assumption.Fan and Song [25], relax the Gaussian assumption by allowing for sub-exponential covariates and errors, and our rates are similar to theirs up to a constant factor.Additionally, in our work we relax the sub-exponential assumption, provided the tails of the covariates and errors are exponentially light.

Generalized Least Squares Screening (GLSS)
Consider the marginal model: where ρ k is the linear projection of y t onto X tk .In SIS, we rank the magnitudes of the OLS estimates of this projection.In a time series setting, if we are considering the marginal model ( 14) it is likely the case that the marginal errors ( t,k ) will be serially correlated.This holds even if we assume that the errors ( t ) in the full model (1) are serially uncorrelated.A procedure which accounts for this serial correlation, such as Generalized Least Squares (GLS), will provide a more efficient estimate of ρ k .We first motivate our method by considering a simple univariate model.Assume Y t = βX t + t and the errors follow an AR(1) process, t = ρ t−1 + θ t , where θ t , and X t are iid standard Gaussian.We set β = .5,n = 200, and estimate the model using both OLS and GLS for values of ρ ranging from .5 to .95.The mean absolute errors for both procedures is plotted in figure 1.We observe that the performance of OLS steadily deteriorates for increasing values of ρ, while the performance of GLS stays constant.This suggests that a screening procedure based on GLS estimates will be most useful in situations where we have weak signals and high levels of serial correlation.
The infeasible GLS estimate for ρ k is: Where X k is the k th column of X, and Σ k = (γ i−j,k ) 1≤i,j≤n is the autocovariance matrix of k = ( t,k , t = 1, ..., n).Given that Σ k needs to be estimated to form our GLS estimates, we use the banded autocovariance matrix estimator introduced in [52], which is defined as: Where l n is our band length, γr,k = ρk , and ρk is the OLS estimate of ρ k .Our GLS estimator is now: When E( k |X k ) = 0, by the Gauss-Markov theorem it is clear that βM k is efficient relative to the OLS estimator.[1] showed that under non-stochastic regressors and appropriate conditions on the error process, a two stage sieve type GLS estimator has the same limiting distribution as the infeasible GLS estimator βM k .In the appendix, we provide the appropriate conditions under which our GLS estimator, βM k , and the infeasible GLS estimate, βM k , have the same asymptotic distribution.
For positive definite Σ k , the banded estimate for Σ k is not guaranteed to be positive definite, however it is asymptotically positive definite (see Lemma 1).For small samples, we can preserve positive definiteness by using the tapered estimate: Σk * R ln , where R ln is a positive definite kernel matrix, and * denotes coordinate-wise multiplication.For example, we can choose R ln = (max(1 − |i−j| ln , 0)) 1≤i,j≤n .We need the following conditions for the sure screening property to hold: Condition H: Assume t,k , t are of the form (2), and the covariate process is of the form (6). Additionally we assume the following decay rates ∆ m,q ), for some α x , α > 0, α = min(α x , α ), and q = min(q, r) ≥ 4.
In Condition E, we can let the band length K diverge to infinity at a slow rate, e.g O(log(n)), for simplicity we set K to be a constant.Assuming a finite order AR model for the marginal error process is reasonable in most practical situations, since any stationary process with a continuous spectral density function can be approximated arbitrarily closely by a finite order linear AR process (see corollary 4.4.2 in [7]).For further details on linear AR approximations to stationary processes, see [1] and [8].We remark that compared to previous works [1,34], knowledge about the structure of the marginal errors is not necessary in estimating β M k , since we use a non-parametric estimate of Σ k .Therefore Condition E is assumed strictly for technical reasons.
For Condition F, from ( 14), we have If we assume the cross covariance, we believe the advantage in using GLSS is due to the GLS estimator being robust to serial correlation in the marginal error process (see the appendix for details).
For Condition H, since t,k = Y t − X tk ρ k , we have t,k = r k (. . ., θ t−1 , θ t ), where r k (•) is a measurable function and θ t = (η t , e t ).If we assume e t , and η i are independent for i = t, then θ i are iid.We then have: Given Condition H, it follows that K x,r , K ,q < ∞.For the case of exponentially light tails, we define φ = 2 1+2 αx+2ϕ , φ = 2 1+4ϕ , and α = 2 1+4 αx .Lastly, for ease of presentation let: We first present the following lemma, which provides deviation bounds on || Σk,ln − Σ k || 2 .This lemma, which is of independent interest, will allow us to obtain deviation bounds on our GLSS estimates.
Lemma 1. Assume the band length, l n = c log(n) for sufficiently large c > 0.
(i) Assume Condition H holds.For κ ∈ [0, 1/2) we have the following: (ii) Assume Condition I holds.For κ ∈ [0, 1/2) we have the following: The following theorem gives the sure screening properties of GLSS: Theorem 3. Assume the band length, l n = c log(n) for sufficiently large c > 0.
(i) Assume Conditions E,F,G,H hold, for any c 2 > 0 we have: Assume Conditions E,F,G,I hold, for any c 2 > 0 we have: (iv) Assume Conditions E,F,G,I hold, then for γ n = c 5 n −κ with c 5 ≤ c 6 /2: In Lemma 1, the rate of decay also depends on the band length (l n ).The band length primarily depends on the decay rate of the autocovariances of the process t,k .Since we are assuming an exponential decay rate, we can set We omit the exponential terms in the bounds for part part (i) of Lemma 1, and parts (i), and (ii) of Theorem 3 to conserve space and provide a cleaner result.For GLSS, the range for p n also depends on the band length (l n ), in addition to the moment conditions and the strength of dependence in the covariate and error processes.For example, if we assume r = q, and α ≥ 1/2 − 2/r then ).Compared to SIS, we have a lower range of p n by a factor of l r/2+1 n .We conjecture that this is due to our proof strategy, which relies on using a deviation bound on || Σk,ln − Σ k || 2 , and uses the functional dependence measure, rather than autocorrelation, to quantify dependence.In practice, we believe using GLSS, which corrects for serial correlation, and uses an estimator with lower asymptotic variance will achieve better performance.We illustrate this in more detail in our simulations section, and in the appendix (section 9.2).
Similar to SIS, we can control the size of the model selected by GLSS.For the case when β M k = ρ k ∀k, the bound on the selected model size is the same as in SIS.However, we need to place an additional assumption when we can bound the selected model size by the model size selected by SIS.More formally we have:

Second Stage Selection with Adaptive Lasso
The adaptive Lasso, as introduced by [58], is the solution to the following: and βI,j is our initial estimate.For sign consistency; when p n >> n, the initial estimates can be the marginal regression coefficients provided the design matrix satisfies the partial orthogonality condition as stated in [29], or we can use the Lasso as our initial estimator provided the restricted eigenvalue condition holds (see [39]).Both of these conditions can be stringent when p n >> n.This makes the adaptive Lasso a very attractive option as a second stage variable selection method, after using screening to significantly reduce the dimension of the feature space.We have the following estimator: Where X Mγn denotes the n × d n submatrix of X that is obtained by extracting its columns corresponding to the indices in Mγn .We additionally define X Mγ n accordingly.Our initial estimator βI = ( βI,1 , . . ., βI,dn ) is obtained using the Lasso.Let ΣMγ n = X T Mγ n X Mγ n /n, and let Σ Mγ n be its population counterpart.Our two stage estimator, β Mγn , is then formed by inserting zeroes corresponding to the covariates which were excluded in the screening step, and inserting the adaptive Lasso estimates, β Mγn , for covariates which were selected by the screening step.We need the following conditions for the combined two stage estimator to achieve sign consistency: Condition J: The matrix Σ M γn 2 satisfies the restricted eigenvalue condition, RE(s n ,3)(see [6] for details): where Condition K: Let λ n and λ I,n be the regularization parameters of the adaptive lasso and the initial lasso estimator respectively.For some ψ ∈ (0, 1), we assume: Condition L: Let β min = min i≤sn |β i |, and w max = max i≤sn w i > 0. Assume β min > 2 wmax and β min > 2c λ I,n sn φ0n .
Condition J allows us to use the Lasso as our initial estimator.Notice that we placed the RE(s n ,3) assumption on the matrix Σ M γn

2
, rather than the matrix Σ Mγn , given the indices in Mγn are random as a result of our screening procedure.Recall that for SIS, , and for GLSS we have a similar definition.Therefore, we are placing the RE(s n ,3) assumption on the population covariance matrix of a fixed set of d n predictors.Conditions K and L are standard assumptions, and are similar to the ones used in [39].Condition K primarily places restrictions on the rate of increase of λ n , and λ I,n .Condition L places a lower bound on the magnitude of the non-zero parameters which decays with the sample size.
The next theorem deals with the two stage SIS-Adaptive Lasso estimator.A very similar result applies to the two stage GLSS-Adaptive Lasso estimator, if we replace Conditions A,B,C (resp.D) with Conditions E,F,G,H (resp.I), to avoid repetition we omit the result.For the following theorem, the terms ι, ω, K x,r , and K ,q have been defined in the paragraph preceding Theorem 1, and α , α have been defined in Theorem 2.
Theorem 5. (i) Assume Conditions A,B,C,J,K,L hold, then for γ n = c 3 n −κ with c 3 ≤ c 1 /2 we have: ) (ii) Assume Conditions A,B,C,J,K,L hold, then for γ n = c 3 n −κ with c 3 ≤ c 1 /2 we have: To achieve sign consistency for the case of finite polynomial moments we require: For the case of exponential moments, we require: and From Conditions M, N, and Theorem 3, we see an additional benefit of using the two stage selection procedure as opposed to using the adaptive Lasso as a stand alone procedure.For example, if we assume d n ≤ n 2κ λ max (Σ) = O(n), and that both the error and covariate processes are sub-Gaussian, we obtain p n = o(exp(n )) for the two stage estimator.By setting d n = p n , we obtain the result when using the adaptive Lasso as a stand alone procedure, with the Lasso as its initial estimator.Under the scenario detailed above, the dimension of the feature space, which depends on λ n and ψ, for the stand alone adaptive Lasso can be at most p n = o(exp(n 1 6 )).Therefore for κ < 1/4, we obtain a larger range for p n and a faster rate of decay using the two stage estimator.For κ ≥ 1/4 it is not clear whether the two stage estimator has a larger range for p n , compared to using the adaptive Lasso alone.
The sign consistency of the stand alone adaptive Lasso estimator in the time series setting was established in [39].Their result was obtained under strong mixing assumptions on the covariate and error processes, with the additional assumption that the error process is a martingale difference sequence.Additionally, in the ultrahigh dimensional setting they require a geometric decay rate on the strong mixing coefficients.In contrast, we obtain results for both the two stage and stand alone adaptive lasso estimator, and our results are obtained using the functional dependence measure framework.Besides assuming moment conditions, we are not placing any additional assumptions on the temporal decay of the covariate and error processes other than ∆ 0,q ( ),Φ 0,q (x) < ∞.Furthermore, we weaken the martingale difference assumption they place on the error process, thereby allowing for serial correlation in the error process.Finally, by using Nagaev type inequalities introduced in [53], our results are easier to interpret and also allow us obtain a higher range for p n .

Simulations
In this section, we evaluate the performance of SIS, GLSS, and the two stage selection procedure using the adaptive Lasso.For GLSS instead of using the banded estimate for Σ k we use a tapered estimate: Σk * R ln , where Σk = (γ i−j,k ) 1≤i,j≤n and R ln = (max(1 − |i−j| ln , 0)) 1≤i,j≤n is the triangular kernel.We fix l n = 15, and we observed the results were fairly robust to the choice of l n .In our simulated examples, we fix n = 200, s n = 6 and d n = n − 1, while we vary p n from 1000 to 5000.We repeat each experiment 200 times.For screening procedures, we report the proportion of times the true model is contained in our selected model.For the two stage procedure using the adaptive Lasso, we report the proportion of times there was a λ n on the solution path which selected the true model.Consider the model ( 1), for the covariate process we have:
The results are displayed in table 1.The entries below "Gaussian" correspond to the setting where both e i and η i are drawn from a Gaussian distribution.Accordingly the entries under "t 5 " correspond to the case where e i and η i are drawn from a t 5 distribution.We see from the results that the performance of SIS, and GLSS are comparable when p n = 1000, with moderate levels of temporal dependence, along with Gaussian covariates and errors.Interestingly, in this same setting, switching to heavy tails seems to have a much larger effect on the performance of SIS vs GLSS.In all cases, the performance of GLSS appears to be robust to the effects of serial correlation in the covariate and the error processes.Whereas, for SIS the performance severely deteriorates as we increase the level of serial correlation.For example, for our highest levels of serial correlation, SIS nearly always fails to contain the true model.
The results are displayed in tables 2, and 3 respectively.In scenario A, we have a Toeplitz covariance matrix for the predictors, and moderate levels of serial dependence in the predictors.The trends are similar to the ones we observed in case 1.The performance of SIS is sensitive to the effects of increasing the serial correlation in the errors, with the effect of serial dependence being more pronounced as we encounter heavy tail distributions.In contrast, increasing the level of serial dependence has a negligible impact on the performance of GLSS.For scenario B, we observe similar trends as in scenario A.
In both scenarios we have a high degree of correlation between the predictors, low signal to noise ratio, along with mild to moderate levels of serial correlation in the covariate and error processes.The results are displayed in tables 4 and 5 for scenarios A and B respectively.We observe that the two stage estimator outperforms the standalone adaptive Lasso for both scenarios, with the difference being more pronounced in scenario B. For both scenarios, going from mild to moderate levels of serial correlation in the errors appears to significantly deteriorate the performance of the adaptive Lasso.This affects our results for the two stage estimator primarily at the second stage of selection.This sensitivity to serial correlation appears to increase as we encounter heavy tailed distributions.

Real Data Example: Forecasting Inflation Rate
In this section we focus on forecasting the 12 month ahead inflation rate.We use two major monthly price indexes as measures of inflation: the consumer price index (CPI), and the producer price index less finished goods (PPI).Specifically we are forecasting: , or y 12 t+12 = 100 × log Therefore the above quantities are approximately the percentage change in CPI or PPI over 12 months.Our data was obtained from the supplement to [32], and it consists of 132 monthly macroeconomic variables from January 1960 to December 2011, for a total of 624 observations.Apart from log(CP I) and log(P P I) which we are treating as I(1), the remaining 130 macroeconomic time series have been transformed to achieve stationarity according to [32].Treating log(CP I), and log(P P I) as I(1), has been found to provide an adequate description of the data according to [44], [43], [39].
We consider forecasts from 8 different models.Similar to [39,44] our benchmark model is an AR(4) model: ŷ12 t+12 = α0 + 3 i=0 αi y t−i , where y t = 1200 × log(CP I t /CP I t−1 ) when forecasting CPI, and y t = 1200 × log(P P I t /P P I t−1 ) when forecasting PPI.For comparison, we also consider an AR(4) model augmented with 4 factors.Specifically we have: Where Ft are four factors which are estimated by taking the first four principal components of the 131 predictors along with three of their lags.We also consider forecasts estimated by the Lasso and the adaptive Lasso.And lastly we include forecasts estimated by the following two stage procedures: GLSS-Lasso, GLSSadaptive Lasso, SIS-Lasso, and SIS-Adaptive Lasso.Our forecasting equation for the penalized regression and two stage forecasts is: Where x t consists of y t and three of its lags along with the other 131 predictors and three of their lags, additionally we also include the first four estimated factors Ft .Therefore x t consists of 532 covariates in total.For each of the two stage methods, we set d n = n/ log(n) = 73 for the first stage screening procedure.For the second stage selection, and the standalone lasso/adaptive lasso models, we select the tuning parameters and initial estimators using the approach described in section 6.We utilize a rolling window scheme, where the first simulated out of sample forecast was for January 2000 (2000:1).To construct this forecast, we use the observations between 1960:6 to 1999:1 (the first five observations are used in forming lagged covariates and differencing) to estimate the factors, and the coefficients.Therefore for the models described above, t=1960:6 to 1998:1.We then use the regressor values at t=1999:1 to form our forecast for 2000:1.Then the next window uses observations from 1960:7 to 1999:2 to forecast 2000:2.Using this scheme, in total we have 144 out of sample forecasts, and for each window we use n = 451 observations for each regression model.The set-up described above allows us to simulate real-time forecasting.
Table 6 shows the mean squared error (MSE), and the mean absolute error (MAE) of the resulting forecasts relative to the MSE and MAE of the baseline AR(4) forecasts.We observe that the two stage GLSS methods clearly outperform the benchmark AR(4) model, and appear to have the best forecasting performance overall for both CPI and PPI, with the difference being more substantial when comparing by MSE.Furthermore GLSS-lasso and GLSS-adaptive Lasso do noticeably better than their SIS based counterparts with the differences being greater when forecasting PPI.We also note that the widely used factor augmented autoregressions do worse than the benchmark model AR(4) model.

Discussion
In this paper we have analyzed the sure screening properties of SIS in the presence of dependence and heavy tails in the covariate and error processes.In addition, we have proposed a generalized least squares screening (GLSS) procedure, which utilizes the serial correlation present in the data when estimating our marginal effects.Lastly, we analyzed the theoretical properties of the two stage screening and adaptive Lasso estimator using the Lasso as our initial estimator.These results will allow practitioners to apply these techniques to many real world applications where the assumption of light tails and independent observations fails.There are plenty of avenues for further research, for example extending the theory of model-free screening methods such as distance correlation, or robust measures of dependence such as rank correlation to the setting where we have heavy tails and dependent observations.Other possibilities include extending the theory in this work, or to develop new methodology for long range dependent processes, or certain classes of non-stationary processes.Long range dependence, is a property which is prominent in a number of fields such as physics, telecommunications, econometrics, and finance (see [40] and references therein).If we assume the error process ( i ) is long range dependent, then by the proof of Theorem 1 in [52] we have ∆ 0,q ( ) = ∞.A similar result holds for the covariate process, therefore we may need to use a new dependence framework when dealing with long range dependent processes.Lastly, developing new methodology which aims to utilize the unique qualities of time series data such as serial dependence, and the presence of lagged covariates, would be a particularly fruitful area of future research.

Proofs of Results
Proof of Theorem 1.
We first prove part (i), we start by obtaining a bound on: Therefore: For the RHS of (30), we obtain: Therefore it suffices to focus on terms (31), (32).For (31), recall that Recall that T 2 = n t=1 X tj (x t β + t )/n = n t=1 X tj ( pn k=1 X tk β k + t )/n.Now we let: By Condition B, E(X tj t ) = 0, therefore Recall that pn k=1 1 |β k |>0 = s n , thus: . Using this we compute the cumulative functional dependence measure of X tk X tj as: Therefore we obtain: sup m (m + 1) αx ∞ t=m ||X tj X tk − X * tj X * tk || r/2 ≤ 2K 2 x,r .Combining this with (36), and Theorem 2 in [53], yields: Similarly for X tj t , by using Holder's inequality we obtain: Using Theorem 2 in [53], we obtain: For (32), assuming E(X 2 ij ) = O(1) ∀j ≤ p n , and max j≤pn E(X tj Y t ) < L < ∞ we obtain: We then have: We can then bound the above two equations similar to (38).By combining (33) (35), (38), (40), (41), along with union bound we obtain: To prove part (ii), we follow the steps in the proof of Theorem 2 in [36].Let On the set A n , by Condition A, we have: Hence by our choice of γ n , we obtain P M * ⊂ Mγn > P (A n ).By applying part (i), the result follows.
For part (iii) we follow the steps in the proof of Theorem 3 in [36].Using V ar(Y t ), V ar(X tj ) = O(1) for j ≤ p n , along with Condition B, we obtain The result then follows from part (i).
Proof of Theorem 2.
We follow the steps from the proof of Theorem 1.Let T = (T 1 , . . ., T n ) where T i = X ij X ik , and let R = (R 1 , . . ., R n ) where R i = X ij i .We need to bound the sums: By Theorem 1 in [49], Θ q (T ) ≤ ∆ 0,q (T ), and from Section 2 in [53]: ||X ij || q ≤ ∆ 0,q (X j ) ≤ Φ 0,q (x).Additionally, by Holders inequality we have (45) Using these, along with Condition D we obtain: Combining the above and using Theorem 3 in [53], we obtain: Similarly, using the same procedure we obtain: Now using the above bounds and following the steps in the proof of Theorem 1 we obtain the results.
Proof of Lemma 1.
By the proof of Theorem 2 in [52], we have: And (59) And we set M > max k≤pn max i≤ln 2|E( t,k X t+|i|,k )| + , for some > 0. Similarly we have P ( Where we set M 1 < min j≤pn E(X 2 ij )− , for > 0. The same method we used for (53) can be applied to ( 56), (57).Using the techniques in the proof of Theorem 1, and (52), we obtain the result.For (ii), we follow the same procedure as in (i), and apply the methods seen in the proof of Theorem 2.
Proof of Theorem 3.
For (i), as before we start with a bound on: P Using Condition E, we can write: After combining this with (14), it suffices to obtain a bound for: Following the steps in the proof of Theorem 1, it suffices to focus on the terms: We then have: We first deal with the term X T k Σ −1 k k /n.We can rewrite this term as XT k ˜ k /n, where Xk = V k X k , ˜ k = V k k /n, V k is a lower triangle matrix and the square root of Σ −1 k .Ignoring the first L k observations, we can express: , where (α 1,k , . . ., α L k ,k ) are the autoregressive coefficients of the process t,k .We compute the cumulative functional dependence measure of Xt,k ˜ t,k as: We And by our assumptions ||˜ l,k − ˜ * l,k || q = 0, for l > 0. From which we obtain: Using Theorem 2 in [53]: For the term |X T k ( Σ−1 k,ln − Σ −1 k ) k /n|, using Cauchy-Schwarz inequality: Using (69) we obtain: Where the right hand side of (70) is: Let M = M 1 M 2 , where M 1 ≥ max k≤pn E(X 2 i,k )+ , and M 2 = max k≤pn E( 2 i,k )+ , for some > 0. The second term of (71) is: We can bound the above using the same techniques as in the previous proofs.
By Condition E, the spectral density of the process t,k , ∀k ≤ p n is bounded away from zero and infinity.Therefore, 0 [52].We then use: Let a 1 ≥ a 2 ≥ . . .≥ a n be the ordered eigenvalues of Σ Let a j = argmin ai |a −1 i |, using this and ( 73),(74) we obtain: Where M 3 ∈ (0, 1 − ) for > 0. We then have Combining the above with (75) and Lemma 1, we obtain: ) By (64),(68),(70),(72),(76) we obtain a bound for P (|T 2 − E(T 2 )| > Cn −κ ).For the term P (|T 1 − E(T 1 )| > Cn −κ ), we proceed in a similar fashion: We can then obtain a bound on the above terms by following a similar procedure as before.Combining these gives us the result for (i).For (ii), using the result from (i) we follow a similar procedure to the proof of Theorem 1.For (iii) and (iv) we follow the same procedure as (i) and (ii), and apply the methods seen in the proof of Theorem 2; we omit the details.
Proof of Corollary 4.
Recall that: Therefore by our assumption, we have that . We obtain the result, by following the procedure in the proof of Theorem 1 and using the results from Theorem 3.
Proof of Theorem 5.
For simplicity we only prove part (i), the proof for part (ii) follows similarly.We will work on the following set D n = A n ∩ B n ∩ C n , where On the set A n , if we apply screening as a first stage procedure, by our choice of γ n , we obtain: Next we need to use Lemma 7 and 8 in [39], specifically we need to show our reduced model satisfies conditions DGP 3,DESIGN, and WEIGHTS in [39].On the set B n , by Lemma 1 in [39], we have φ Σ Mγn /2 = φ Σ M γn/2 = φ 0 .Therefore, we have: Using this along with Lemma 1 in [39] and Condition J, we have that DESIGN 3a is satisfied with φ min = φ 0 /16, where inf v T v=1 v T Σ 11 v > 2φ min > 0, and Σ 11 is the covariance matrix of the relevant predictors.On the set D n , by Conditions K and L in our work, and Lemma 2 and proposition 1 in [39], assumption WEIGHTS is satisfied.On the set A n ∩B n , DGP 3 and DESIGN 3b are satisfied, while DESIGN 2 is satisfied by Condition L. Now by proposition 2, Lemmas 7 and 8 in [39] we obtain: P (sgn( β Mγn ) = sgn(β)) ≥ P (A n ∩B n ∩C n ) ≥ 1−P (A n )−P (B n )−P (C n ) (79) P (A n ) is given in Theorem 1 part i.For P (B n ) using the method in the proof for Theorem 1, we obtain: And for P (C n ): To prove part ii) we follow the same steps from part i).We obtain P (A n ), P (B n ), P (C n ) by following the method in the proof of Theorem 2, and using Theorem 3 in [53].

Asymptotic Distribution of GLS estimator
By the proof of theorem 3, both these conditions are satisfied, therefore βM k , and βM k have the same asymptotic distribution.We use the above lemma, and rely on the asymptotic distribution of βM k to provide an explanation for the superior performance of GLSS, and its robustness to increasing levels of serial correlation in t,k .We deal with three cases, and we assume an AR(1) process for the errors for simplicity and ease of presentation.The results can be generalized to AR(p) processes, by using the moving average representation of t,k :

Case 1:
We start with the setting used in figure 1, assume x t,k is iid and t,k = α t−1,k + e t , with x t,k , and t,k being independent ∀t.Using Gordin's central limit theorem [28], we calculate the asymptotic distribution of √ n( βM k − β M k ) → N (0, J), where J = (1−α 2 ) .Therefore the variance of the OLS estimator increases without bound as α increases towards 1.Whereas the variance of the GLS estimator actually decreases as α increases.

Case 2:
We expand this to the case when x t,k is temporally dependent, for simplicity we let x t,k = φx t−1,k + η t .We still assume x j,k and t are independent ∀j, t, and t,k = α t−1,k + e t .This is the setting for the first model in the simulations section.Using Gordin's central limit theorem, and elementary calculations: √ n( βM k − β M k ) → N (0, J), where J = .We clearly see that for fixed φ, the GLS estimate is robust to increasing α, whereas the variance of the OLS estimator increases without bound as α increases towards 1.This sensitivity to α provides an explanation for the results seen in case 1 of the simulations, which show the performance of SIS severely deteriorates for high levels of serial correlation in t,k Case 3: In both the previous cases, it is easy to see the GLS estimator is asymptotically efficient to the OLS estimator.For the case where X k = (x t,k , t = 1, . . ., n) and k = ( t,k , t = 1, . . ., n) are dependent on each other, it is more complicated.In this setting, it is likely the case that ρ k = β M k .Assume t,k = α t−1,k + e t , and let x t,k − αx t−1,k = xt,k , and W 1 = ∞ i=−∞ γ(i), where γ(i) = cov(x t,k e t , xt−i,k e t−i ).We start by examining the asymptotic distribution of √ n( βM k − β M k ) → N (0, J), where J = W 1 /(var(x t,k )) 2 .By the proof of theorem 1 in [52] From these results we see that the asymptotic variance of the GLS estimator is bounded when α increases towards 1, and is largely robust to increasing levels of serial correlation in t,k .This result seems to provide an explanation for GLSS being robust to increasing levels of serial correlation in our simulations.
For the OLS estimator we obtain, (ρ k −ρ k ) → N (0, V ), where V = W 2 /(var(x t,k )) We see the above bound is very sensitive to increasing serial correlation in t,k .Although this is an upper bound to the asymptotic variance, it seems to explain the deterioration in performance of SIS when increasing the serial correlation of t,k in our simulations.

Fig 1 .
Fig 1. GLS vs OLS error comparison for values of ρ between .5 and .95incrementing by .05.Absolute error averaged over 200 replications.

Lemma 2 .
Assume conditions E,F,G,H hold, then √ n( βM k −β M k ) and √ n( βM k − β M k ) have the same asymptotic distribution.Proof of Lemma 2. It is clear that sufficient conditions for the feasible GLS estimator βM k , and βM k to have the same asymptotic distribution are [15]:

Table 2
Case 2: Scenario A

Table 4
Case 3: Scenario A