Spectral Analysis of High-dimensional Time Series

A useful approach for analysing multiple time series is via characterising their spectral density matrix as the frequency domain analog of the covariance matrix. When the dimension of the time series is large compared to their length, regularisation based methods can overcome the curse of dimensionality, but the existing ones lack theoretical justification. This paper develops the first non-asymptotic result for characterising the difference between the sample and population versions of the spectral density matrix, allowing one to justify a range of high-dimensional models for analysing time series. As a concrete example, we apply this result to establish the convergence of the smoothed periodogram estimators and sparse estimators of the inverse of spectral density matrices, namely precision matrices. These results, novel in the frequency domain time series analysis, are corroborated by simulations and an analysis of the Google Flu Trends data.


Introduction
Spectral density matrices play a large role in characterising the second order properties of multivariate time series. The spectral density matrix is the frequency domain analog of the covariance matrix, and describes the variance in each dimension or the covariance between dimensions that can be attributed to oscillations in the data within certain frequencies. Just as how partial correlations between the dimensions can be extracted as a function of the inverse of a covariance matrix, conditional relationships attributable to variations in the oscillations of the data can be obtained from the inverse of the spectral density matrix [12]. Thus, it is necessary to obtain a positive-definite estimate of the spectral density matrix, but this can be challenging whenever the dimensionality of the time series is relatively large compared to the length of the time series.
The utility of estimating the inverse of the spectral density matrix is in estimating the frequency domain analog of the partial correlation structure, known as partial coherence, of the multivariate time series. This has been useful in many different contexts, including intensive care monitoring [19], brain connectivity in neuroimaging studies [36], signal processing [27], hydrology [24], and cardiology [34]. High-dimensional time series arise in various scientific contexts, such as in recordings of brain activity from numerous regions of the brain [36] or financial time series obtained from a large number of assets in a portfolio [14]. However, estimating the conditional relationships between the dimensions of the time series using partial coherence can be challenging in finite samples.
There have only been a few papers dedicated to developing rigorous theory in the context of a high-dimensional time series. For instance, [13] and [21] both developed methods to give sparse estimates of the parameters of a vector autoregressive (VAR) model. [2] studied the theoretical properties of regularised estimates of the parameters of a broad class of time series models. [38] considered the estimation of large covariance and precision matrices from high-dimensional sub-Gaussian or heavier-tailed observations with slowly decaying temporal dependence. [11] studied a Dantzig-selector type regularized estimator for linear functionals of high-dimensional linear processes. These recent works, however, focused primarily on time series models in the time domain.
There have been developments in frequency domain methodologies, though these proposed methodologies lack theoretical justifications. For instance, [17], [18], and [37] developed variations of a shrinkage framework developed by [4] for data-driven 2 -penalised estimation of the spectral density matrix, and illustrated their utility for connectivity analysis in neuroimaging data; [25] developed a graphical lasso approach for estimating a graphical model for high-dimensional time series data in the spectral domain; [1] utilised a dynamic factor model to study the network structure of high-dimensional financial time series. These methodological papers have deep roots in the application areas, where we are aware of the demand of estimating high-dimensional spectral density matrices and its inverse, where the latter is necessary in order to quantify the conditional dependency structure of the data. However, there remains a critical gap in theoretical investigations on frequency domain methodologies for estimating the inverse of high-dimensional spectral density matrices.
The aim of this paper is to study the theoretical behaviours of estimators of the spectral density matrix and its inverse in high dimension. We summarise the main contributions of this paper.
First, it is arguable that the most important ingredient in high-dimensional statistical inference, in contrast with classical ones, is the fixed-sample results. To be specific, in order to allow for high dimensions, a common practice is to exploit concentration inequalities, then to provide fixed-sample results to control the differences between the sample and the population versions, and finally to use union bound arguments to derive desirable results. To the best of our knowledge, this paper is the first to show such fixed-sample results on the error control of the smoothed periodogram matrices in Theorem 3.1. This is a challenging task, and the main difficulty in developing such methods comes from the fact that the text book results on frequency domain time series are limited to asymptotic results only [5,6].
Second, once the fixed-sample results are established, a wide range of highdimensional statistics methods are ready to be justified, including estimation, prediction and inference tools. In this paper, we use the sparse precision matrix estimation problem as an example, and demonstrate the theoretical (see Proposition 3.2) and numerical performances of applying the constrained 1minimisation for inverse matrix estimation [clime, 8] to spectral analysis of time series data. We would like to mention that the possible applications of Theorem 3.1 are way beyond Proposition 3.2, while we use the sparse precision matrix estimation as an example.
The rest of this paper is organised as follows. In Section 2, we explain the methodology used in this paper. The theoretical results are collected in Section 3, including two main theorems. The technical details thereof can be found in the Appendix. In Section 4, we demonstrate the numerical performances of our proposed methods, via simulations and real data analysis.

Framework and notation
In order to study the theoretical performances, we adopt the functional dependency framework [45]. Let X t = (X t,1 , . . . , X t,p ) ∈ R p be centred random vectors satisfying where e t are i.i.d. random vectors, F t = (. . . , e t−1 , e t ), and .
which is used as a dependency measure. It has been pointed out in [45] that a large family of common time series models, including linear processes, Volterra series, nonlinear transforms of linear processes and some nonlinear time series models, can be characterised by imposing proper conditions on (2.2). As a price we have to pay for this generality, the functional dependency assumption does exclude the long-range memory models, which can be handled by the temporal dependency structure proposed by [38]. For rest of this paper, for any vector v = (v 1 , . . . , v m ) ∈ C m , let v q := m i=1 |v i | q 1/q be the q -norm of v; for any matrix A = (A ij ) p i,j=1 ∈ C p×p , let A w := sup v: v w ≤1 Av w . We use the sparsity definition in [9] to characterise the sparsity of precision matrices, i.e. let the parameter space G q (c n,p , M n,p ) be denoted by where 0 ≤ q < 1, c n,p and M n,p are potentially diverging as n and p grow.

The sparse inverse periodogram estimator
In the following sections, we define the spectral density matrix, introduce the estimators thereof, and propose a method to estimate the inverse of the highdimensional spectral density matrix for any arbitrary frequency. We convert the time domain time series data into frequency domain using the discrete Fourier transform, which results in the data being a complex-valued vector. Motivated by the properties of the complex-valued normal distribution, we separate the real and imaginary parts of the transformed data and double the dimension of the vectors. At each frequency point, we adopt a moving window and construct the estimator of the inverse of the periodogram, based on the clime estimator proposed in [8]. The detailed algorithm is in Algorithm 1. We will first state our algorithm, and explain the details regarding the smoothed periodogram and its inverse in Sections 2.3 and 2.4, respectively.
In Algorithm 1, (·) and (·) denote the real and imaginary parts of an object, respectively, and preserve the same format of the object. In our case, the input D C ∈ R n×p , and therefore (D C ), (D C ) ∈ R n×p . As for the algorithm clime, see Section 2.4 and [8] for details.

Real-valued smoothed periodogram estimators
Let {X t } t∈Z be a p-variate mean zero stationary real-valued time series with autocovariance matrix function Γ(h) = Cov(X t , X t+h ) = E(X t X t+h ), for h ∈ Z. Under these conditions, {X t } has a continuous spectral density matrix given by Given an interval of the whole time series, namely {X t } t=1,...,n , the periodogram defined at the Fourier frequencies , and for any complex-valued vector v, v * denotes v , i.e. the conjugate transpose of v.
When p = 1, it is known that E(P n (ω)) converges uniformly to f (ω) on [−1/2, 1/2] [e.g. 6, Proposition 10.3.1], but P n (ω) does not converge in probability to f (ω) as T → ∞ [e.g. 6, Theorem 10.3.2]. A common remedy is to use the smoothed periodogram, given by When p = 1, it can be shown that if M n → ∞ and M n /n → 0 as n → ∞, When p → ∞ as T → ∞, we are interested in the conditional dependence structures of the pairs of coordinate, namely by defining Θ(ω) = f (ω) −1 , our goal now is to provide a sparse estimator of Θ(ω) with desirable large-sample properties. Note that both f (ω) and Θ(ω) are complex-valued matrices. To make the following discussion easier, we first transform them into real-valued matrices.
For any j = − (n − 1)/2, . . . , n/2 and ω j = j/n, since it follows from Lemma A.1 in the Appendix, that Θ(ω j ) has the form A j + ı B j , where A j and B j satisfy Therefore, our problem is transformed to finding the inverse of Therefore, for any j = − (n − 1)/2, . . . , n/2 and ω j = j/n, instead of directly studying f n (ω j ), our targets are now and sample version .

Penalised precision matrices at every frequency point
Now we have a sequence of expanded but real-valued smoothed periodogram matrices at every frequency point, i.e. Σ j , j = − (n − 1)/2 , . . . , n/2 . As for each one, our goal is to obtain a sparse inverse matrix. In the last decade, a number of statistical methods have been proposed to achieve this goal, including graphical Lasso [e.g. 48], node-wise regression [e.g. 30], constrained 1minimisation for inverse matrix estimation [clime, 8], adaptive clime [9] and the innovated scalable efficient estimation [15], among others.
In this paper, we do not intend to compare different sparse precision matrix estimation methods, but to apply the clime method for the sake of simplicity in technical details, and to provide with an example for consistent sparse precision matrix estimation in the high-dimensional frequency domain time series context. For details of the clime method, we refer readers to [8], which studies the inverse of the covariance matrices, and in which the sparse precision matrix estimators are obtained based on the sample covariance matrices of i.i.d. random vectors. For completeness, we include the definition of the estimators. For In practice, one can also symmetrise the estimator and obtain

Theory
In Theorem 3.1, we will provide fixed-sample results for the spectral density matrix of a high-dimensional time series, in the form of an entry-wise error control between the smoothed periodogram estimator and the spectral density matrix. This is a fundamental step in proving many different types of high-dimensional statistical problems. To theoretically justify the sparse precision matrix estimator we proposed in Section 2, but more importantly, to demonstrate the power of Theorem 3.1, in Proposition 3.2, we show the uniform consistency of the sequence of precision matrices { Θ j }. As pointed out in Section 2.1, in order to provide the desired results, we are using the functional dependency framework described by (2.1) and (2.2). To further characterise the dependency, we introduce Assumption 1. This is also used in [11], and we refer interested readers there for examples.
and for some constant κ > 0 and C 0 > 0, Note that the fixed-sample result holds for all dimensionality, but in order to achieve desirable consistency results, we need extra conditions on the dimensionality of the data, which is detailed in Assumption 2. Note that we can actually handle a super-polynomial rate of n for p, but in order to be specific, we assume p is of any polynomial rate of n as described in Assumption 2.

Assumption 2. Assume:
• there exists constant c > 0 such that p ≤ cn r for some r > 0; Assumption 3 is only used to achieve the consistency of the sparse precision matrix estimators in Proposition 3.2. Under Assumption 2, Equation (3.1) holds even when the 1 -and q -norms of Θ j , j = 1, . . . , n, diverge, as n grows unbounded. Therefore, Assumption 3 is a reasonably weak condition.

Assumption 3. Recall the parameter space
Assume for q ∈ [0, 1) the following holds: where C 2 > 0 is a constant with respect to n, depending on C 1 and H. If we further assume Assumption 2, then we have The fixed-sample result in (3.2) holds for any choices of sample size n, dimensionality p and the smoothing window size M n . It holds in the functional dependency framework detailed in Assumption 1, and provides an entry-wise error control of the smoothed periodogram and the spectral density matrix. We adopt a union bound argument to handle the dimensionality and to provide a uniform result across the sampled frequency points.
It is worth mentioning that the probability upper bound allows for any H > 0, which allows for the dimensionality diverges at any arbitrary polynomial rate as the sample size diverges. This is made explicit in Assumption 2.
The detailed proof of Theorem 3.1 is in the Appendix. Here, we briefly outline the sketch of the proof. We start with a fixed frequency point and a fixed entry in the matrix. In order to bound the errors between the smoothed periodogram matrix f and the spectral density matrix f , we introduce a series of instrumental quantities, including an m-dependent series using conditional expectations, its truncated version which is truncated in magnitude by (log(n)) 2 , and a centred version by subtracting the unconditional expectations. The majority of the efforts are therefore dedicated to bound the differences of all these different quantities. Applying triangle inequality yields desirable results for a fixed frequency point and a fixed entry in the matrix. Finally, we apply a union bound argument to obtain (3.2).
3) where C 2 > 0 is a constant with respect to n, depending on C 1 and H.
If we further assume Assumptions 2 and 3, then we have Proposition 3.2 is an application of Theorem 3.1 on the sparse precision matrix estimation. The proof is in fact straightforward based on (3.2) and the proof techniques developed in [8]. Since it is built upon Theorem 3.1, we allow for the same flexibility that in (3.3), H is allowed to be any positive value, and therefore the dimensionality p is allowed to be of any arbitrary order of the sample size n. In practice, the window width M n and the penalty level λ are chosen via data-driven methods, which will be elaborated in Section 4.

Remark 1. For the interest of the periodogram itself, Theorem 3.1 can be interpreted as an error control of the smoothed periodogram estimators in terms
of the entry-wise sup-norm. In addition, for general w-norm of matrices, one can obtain an error control of the inverse ofΘ)(ω j ) by exploiting Theorem 2.5 in [39]. In particular, when w = 2, it has been pointed out in [3] that

Simulations
In this section, we verify our proposed methodology using simulated data. We consider multivariate time series having dimension p = 10 or 50 with sample size n = 200 or 400. These are challenging scenarios for spectral analysis because the amount of data available to estimate the spectral density matrix and its inverse is related to the smoothing span 2M n + 1 used to smooth the periodogram matrix, and not the length of the time series. In our simulations, we choose M n using the generalised cross-validation (GCV) criterion developed by [33], and the penalty level in clime by the Stability Approach to Regularization Selection (StARS) method [28]. We construct the smoothed periodogram matrixf n (ω) and calculate its inverse (whenever possible) and use these estimators in order to assess relative performance.
We investigated multiple scenarios in this study: we simulated from (1) a pvariate Gaussian white noise model, (2) a p-variate first-order vector autoregressive (VAR(1)) model, whose parameters we give below, (3) a p-variate VAR(1) model whose conditional dependence structure between the dimensions is driven by a sparse precision matrix of the innovations, and (4) a p-variate vector autoregressive moving average (VARMA) model, whose parameters we give below.
Setting (1) allows us to see how our methodology performs relative to the smoothed periodogram matrix in a very simple scenario where the spectral density matrix and its inverse do not change across frequencies, which allow us to evaluate relative performance only as a function of dimensionality. Setting (2) allows us to see how our methodology performs when the data exhibit some degree of autocorrelation and lagged cross-correlation. To construct the VAR(1) model, we set the p × p coefficient matrix to be a banded matrix such with diagonal entries set to be 0.5, and for the jth row, j ∈ {1, . . . , p − 2}, we set the (j +1)th column to be −0.3 and the (j +2)th column to be 0.2. We use the identity matrix as the covariance matrix for the innovations in the model. Setting (3) creates heterogeneity in the marginal variances, and hence, in the diagonal elements of the spectral density matrix, but truth has a sparse conditional dependence structure. In particular, we let the VAR(1) coefficient matrix be a diagonal matrix with entries randomly selected from the interval (0.25, 0.75), and a random sign. The precision matrix for the innovations vector is sparse, with off-diagonal elements equal to 0 or 0.5 with probability 0.5. Setting (4) allows us to see how our methodology performs when the data generating mechanism has a relatively more complicated dependency structure. We borrow the parameter settings for the VARMA(2,2) model from another paper [44]. Namely, for the autoregressive part of the model, we use diagonal coefficient matrices such that the diagonal element of the k-th coefficient matrix is equal to 0.4/k. For the moving average part of the model, the k-th coefficient matrix is block diagonal with 5×5 lower triangular blocks with value 0.3/k on the diagonals and 0.3/(2k) on the off-diagonal. We use the identity matrix as the covariance matrix for the innovations in the model. Thus, the dependencies between dimensions is null between the blocks and is non-null within the blocks.
We evaluate performance in the following ways. First, we use the mean integrated squared error (MISE), defined by where {ω j } n/2 j=1 denote the Fourier frequencies in the interval (0, 0.5), · * denotes the Frobenius norm of a matrix but discarding the diagonal entries, i.e. for a matrix A = (A ij ) ∈ R p×p , The reason we are discarding the diagonal entries is that we are mainly interested in the off-diagonal entries, and the penalisations deployed in obtaining the sparse precision matrix estimators inevitably introduce bias, especially for the diagonal entries. If one would like a better estimator of the diagonal entries, one can adopt an optional second step updating the diagonal entries only by forcing the product of the smoothed periodogram matrix and the sparse precision matrix to be identity. Due to the lack of theoretical guarantees, we omit this optional step in this paper. We compare our estimator (SIPE) to the naïve inverse of the smoothed periodogram matrix (Naïve), with smoothing span being the modified Daniell kernel with bandwidth picked using the GCV criterion, and the shrinkage estimator (Shrinkage) by [4]. We collect the numerical results averaged over 50 repetitions in each setting in Table 1. Each cell of the table is of the form mean (standard deviation). Since the Naïve estimator and the Shrinkage estimator do not produce sparse estimation, we only report the evaluations on the support recovery for the SIPE. We define the true positive proportion (TPP) and true negative proportion (TNP) as follows. TPP = #non-zero off-diagonal entries in the estimator #non-zero off-diagonal entries in the truth , TNP = #zero off-diagonal entries in the estimator #zero off-diagonal entries in the truth .
The results reported are averaged across all frequencies.
First, looking across all simulation settings, we see that the smoothed periodogram matrix sometimes cannot be inverted, motivating the need for some type of regularisation. The spectral density matrix for the white noise (WN) model is the identity matrix across all frequencies. The Shrinkage is biased towards a scaled identity matrix, hence its superior performance in this setting for all dimensionalities and sample sizes. When the time series data possess autocorrelation, such as in the VAR(1), sparse VAR(1) (sVAR(1)), and VARMA models, SIPE is competitive with the shrinkage estimator with respect to MISE, yet can reasonably estimate the zero and non-zero entries of the precision matrices. In contrast, the shrinkage estimator behaves like a ridge estimator, and hence, by construction cannot obtain sparse estimates of the inverse spectral density matrix. We see that our estimator yields favourable estimates of the spectral precision matrix while giving relatively good estimates on which entries of the spectral precision matrix are truly zero or non-zero.

Analysis of the Google Flu Trends data
We give an empirical illustration of our proposed methodology by analysing the Google Flu Trends data set. Researchers at Google used select Google search terms to predict influenza activity [20]. The resulting data set consists of weekly predicted numbers of influenza-like-illness related visits out of every 100,000 random outpatient visits within select cities throughout the United States of America. The data set is further aggregated at the state-level and region-level, where the latter comprises of different states. The version of the Google Flu Trends data set we used is the state-level aggregate of log-transformed weekly data from 1 January 2006 to 6 October 2013. The resulting time series thus has p = 50 dimensions and length n = 406. The goal of our analysis is to investigate the conditional dependencies of the time series across states. To this end, we need to estimate the partial coherence matrix, which is a function of the inverse of the spectral density matrix. The partial coherence matrix is the frequency domain analog of partial correlation, and can be interpreted as the correlation between two time series that have been bandpass filtered at frequency ω, after removing the linear effects of the other time series. The (j, k)th element of the partial coherence matrix is ρ jk (ω) = |Θ jk (ω)| 2 /[Θ jj (ω)Θ kk (ω)], where Θ(ω) is the inverse spectral density matrix. We use our methodology to obtain a sparse estimate of Θ(ω), from which we can then obtain estimates of partial coherence. We are only interested in the partial coherence matrix, and so we centre each time series to have mean zero and then we standardised them to have unit variance.
To pick the parameters of our method, we choose M n using the GCV criterion. Each of the fifty time series were driven by frequencies within the frequency band (0, 0.10), as shown by the diagonal entries of f n in Figure 1. Indeed, for each of the fifty time series, the variance attributed to each Fourier frequency outside of this band is less than 5% of the overall variation. Thus, we estimate the partial coherence within this frequency band, and we further summarise our results by taking the median partial coherence within this frequency band. We show our results in Figure 2.
Each of four geographically distinct states (California, New York, Minnesota and Mississippi) yields different conditional independencies as estimated by partial coherence. First, we see a local spatial structure. For instance, we see conditional dependencies between Minnesota and its neighbouring Midwest states, and conditional dependencies between Mississippi and Alabama. Note that the spatial structure of the data was not included in the analysis, yet the association in the time series between neighboring states is still relatively strong even after removing for the effects of the other states. Second, we also conditional dependencies between geographically distance states, e.g., New York with Washington and Wisconsin. Altogether, the partial coherence gives us information on conditional dependencies that can be attributed to the frequencies driving the variation in the data. Our analysis show both spatially localized and spatially distant conditional dependence structures in the level of influenza activity in the United States, which is consistent with the results by Davis et al. [13].

Appendix
In this section, we collect all the necessary technical details.
Proof. It follows from the fact ZZ −1 = I p that which is equivalent to Therefore, Proof of Theorem 3.1. This proof starts with proving the result for any fixed ω. For any (i, j) ∈ {1, . . . , p} ⊗2 , note that the (i, j) entry of the periodogram P n (ω) can be written as n,ij (ω) + P (2) n,ij (ω) + P Next, we are to bound the three terms in the right-hand side of (A.1) separately. As for the term (I), we will approximate it by a similar quantities built up by m-dependent random variables. Let The last identity in (A.2) follows from the trigonometric identity that for any α, which is not a multiple of 2π, and any positive integer m ∈ N + , we have m k=0 cos(φ + kα) = sin((m + 1)α/2) cos(φ + mα/2)/ sin(α/2).