Confidence regions for entries of a large precision matrix

Precision matrices play important roles in many practical applications. Motivated by temporally dependent multivariate data in modern social and scientific studies, we consider the statistical inference of precision matrices for high-dimensional time dependent observations. Specifically, we propose a data-driven procedure to construct a class of simultaneous confidence regions for the precision coefficients within an index set of interest. The confidence regions can be applied to test for specific structures of a precision matrix and to recover its nonzero components. We first construct an estimator of the underlying precision matrix via penalized node-wise regressions, and then develope the Gaussian approximation results on the maximal difference between the estimated and true precision matrices. A computationally feasible parametric bootstrap algorithm is developed to implement the proposed procedure. Theoretical results indicate that the proposed procedure works well without the second order cross-time stationary assumption on the data and sparse structure conditions on the long-run covariance of the estimates. Simulation studies and a real example on S&P 500 stock return data confirm the performance of the proposed approach.


Introduction
With an ever-increasing capacity to collect and store data, industry, business and government offices all encounter the task of analyzing the data of unprecedented size arisen from various practical problems such as panel studies of economic, social and natural (such as weather) phenomena, financial market analysis, genetic studies and communications engineering. A significant feature of these data is that the number of variables recorded in each observation is extremely large. Meanwhile, most real data exhibit evidence that the observations on the same unit over time are temporal dependent. Therefore, many well-developed statistical inference methods for independent and identically distributed (i.i.d.) data may not work here. Those features of modern data bring both opportunities and challenges to statisticians and econometricians.
Covariance matrices provide a measure for the marginal linear dependence among components of a random vector. There are a set of recent works on estimation and hypothesis testing of highdimensional covariances with i.i.d. data. See Bickel and Levina (2008a,b), Qiu and Chen (2012), Cai, Liu and Xia (2013), Chang et al. (2017b) and references therein. Compared with the marginal dependence, the conditional dependence captures the direct "link" between two variables when the other variables are conditioned on, which demonstrates the interaction between variables. Gaussian graphical model (GGM) is a widely used tool to model and analyze the conditional dependence relationship among components of a random vector. Under the Gaussian assumption of the data, the precision matrix, defined as the inverse of the covariance matrix, provides an equivalent representation for the conditional dependence. Therefore, analyzing a high-dimensional GGM can be transformed to investigate the structure of the associated precision matrix. Beyond the Gaussian assumption, the bijection relationship between the conditional dependence and the precision matrix might not hold, however, the precision matrix still plays an important role in many statistical applications. For illustration, the analysis of linear regression models, the Kalman recursions in state-space models, and partial correlation graphs. See Examples 1-3 in Section 2 for more details.
Several methods have been proposed to estimate high-dimensional precision matrix Ω with i.i.d. data in the recent few years. Graphical Lasso (Yuan and Lin, 2007;Friedman, Hastie and Tibshirani, 2008) is a penalized likelihood estimation approach with an L 1 penalty on the entries of Ω. Meinshausen and Bühlmann (2006) introduced a neighborhood selection approach, which estimates Ω by finding the nonzero regression coefficients of each variable on all other variables via Lasso (Tibshirani, 1996) or Dantzig estimator (Candes and Tao, 2007). Also see Cai, Liu and Luo (2011), Xue and Zou (2012) and Sun and Zhang (2013) for other penalized methods. Chen, Xu and Wu (2013) investigated the theoretical properties of graphical Lasso with time dependent data. Even though the above mentioned methods can estimate Ω consistently, they cannot provide statistical inference for Ω due to a non-negligible bias term incurred by the penalized methods. Recent progress has been made to overcome this issue. Under the Gaussian assumption, Ren et al. (2015) proposed a novel estimator for each entry of Ω based on pairwise L 1 penalized regressions, and showed their procedure does not incur a bias term. Although their estimator shares some good theoretical advantages, it requires to fit p(p−1) 2 high-dimensional regressions, which is computational intensive when p is large. Liu (2013) proposed a bias corrected estimator for Ω based on only p node-wise regressions with Gaussian distributed data. However, neither of those two approaches consider time dependent observations or global level inference for a target sub-region of Ω. Such global inference is essential to understand the structure of the precision matrix. To this end, a precise asymptotic expansion of the estimator is required, which is missing in the previous literatures.
Let S be a given index set of interest, and n be the sample size. The main goal of this paper is to propose a data-driven procedure to determine a class of confidence regions (C S,α ) 0<α<1 for Ω S with high-dimensional time dependent data such that sup 0<α<1 |P(Ω S ∈ C S,α ) − α| → 0 as n → ∞, where Ω S is a vector whose components are the elements of Ω indexed by S. Such constructed confidence regions are of great practical importance in many statistical applications. For example, it can be used to test for some specific structures of Ω, to detect and recover the nonzero components in Ω consistently, and to construct the simultaneous confidence intervals for the elements in Ω S . We first propose a bias corrected estimator Ω S for Ω S via penalized nodewise regressions, and then investigate its asymptotic expansion without requiring either Gaussian or stationary assumption. Based on the obtained asymptotic expansion, the leading term of n( Ω S − Ω S ) takes the form of a partial sum of an unobservable process. For any A = (a j 1 ,j 2 ), denote by |A| ∞ = max j 1 ,j 2 |a j 1 ,j 2 | the elementwise L ∞ -norm of A. Inspired by the Gaussian approximation technique developed in Kato (2013, 2014), we approximate the distribution of n 1/2 | Ω S − Ω S | ∞ by that of the L ∞ -norm of a high-dimensional normal distributed random vector with mean zero and covariance formulated as an estimate for the long-run covariance of an unobservable process. In practice, we apply a parametric bootstrap procedure to generate a set of independent samples from such defined multivariate normal distribution, and then use the empirical distribution of the L ∞ -norm of those generated random vectors to characterize the probabilistic behavior of n 1/2 | Ω S − Ω S | ∞ . Our analysis shows that the kernel-type estimator initially suggested by Andrews (1991) for the fixed dimensional long-run covariance still works in the proposed procedure for the high-dimensional scenario without imposing any additional stringent structural assumptions on the long-run covariance. Owing to the form of the kernel-type estimator, we construct a computationally feasible algorithm to implement the proposed procedure. As we will discuss in Section 3, our procedure has four significant advantages: (i) it is fully data-driven; (ii) it is adaptive to any structure of the long-run covariance; (iii) it significantly reduces the computational and storage costs in generating bootstrap samples; and (iv) it performs better in approximating the distribution of n 1/2 | Ω S − Ω S | ∞ in finite samples than the limiting distribution of n 1/2 | Ω S − Ω S | ∞ does. Most importantly, there is no guarantee for the existence of the limiting distributions for such L ∞ -type statistics in general. The existence of such limiting distributions usually require additional stringent assumptions that are restrictive and difficult to verify in practice. As an advantage, our procedure is free of those assumptions. Both theoretical and numerical results demonstrate the promising properties of the proposed procedure.
The rest of the paper is organized as follows. Section 2 introduces the problem to be solved and its background. The proposed procedure and its theoretical properties are given in Section 3. Section 4 discusses the applications of our results. Simulation studies and a real data analysis are reported in Sections 5 and 6, respectively. All the technical proofs are relegated to the Appendix. At the end of this section, we introduce the notations used in this paper here. For two real numbers a and b, the notation a b stands for the same order. Namely, 0 < c 1 < |a/b| < c 2 < ∞ for two positive constants c 1 and c 2 . For a series of random sequences {b n,j } ∞ n=1 for j ∈ J , and a real sequence {a n } ∞ n=1 , b n,j = o p (a n ) uniformly for j ∈ J means max j∈J |b n,j /a n,j | p − → 0 as n → ∞. Let |B| be the cardinality of the set B. For an s 1 × s 2 matrix A = (a j 1 ,j 2 ) and an index set A ⊂ {1, . . . , s 1 } × {1, . . . , s 2 } with q = |A|, we denote by A A a q-dimensional vector where the components of A A are a j 1 ,j 2 's indexed by A. Recall we defined before, |A| ∞ = max j 1 ,j 2 |a j 1 ,j 2 | denotes the elementwise L ∞ -norm of A. When A = a = (a 1 , . . . , a s 1 ) T is an s 1 -dimensional vector, we define a A and |a| ∞ in the same way with s 2 = 1. Let I(·) be the indicator function. Let |a| 0 = s 1 j=1 I(a j = 0) and |a| 1 = s 1 j=1 |a j | be the L 0 -and L 1 -norms of a, respectively.

Preliminaries
Let y = (y 1 , . . . , y pn ) T be a p n -variate random vector with mean µ n and covariance Σ n . Let Ω n = Σ −1 n be the precision matrix. Without loss of generality, we assume µ n = 0 in the following analysis. Assume that Y n = {y 1 , . . . , y n } be an observed sample of size n from an R pn -valued time series, where y t = (y 1,t , . . . , y pn,t ) T and each y t has the same first two moments as y, i.e. E(y t ) = µ n and Cov(y t ) = Σ n for each t. Here, we allow the dimension p n grows as the sample size n increases. The parameters µ n and Σ n related to the distribution of y may also change as the increase of n. The subscript n shows their dependence on the sample size. We will drop this subscript n for notation simplification whenever there is no confusion. We assume the temporal dependence among {y t } satisfies the β-mixing condition such that β k → 0 as k → ∞, where Here F t −∞ and F ∞ t+k are the σ-fields generated respectively by {y u } u≤t and {y u } u≥t+k . The βmixing condition is a mild assumption in time series literature. It is well known that causal ARMA processes with continuous innovation distributions, stationary Markov chains under some mild conditions and stationary GARCH models with finite second moments and continuous innovation distributions all satisfy the β-mixing condition. We refer to Section 2.6 of Fan and Yao (2003) for detailed discussion on β-mixing.
Given a precision matrix Ω = (ω j 1 ,j 2 ) p×p and an index set S ⊂ {1, . . . , p} 2 with r = |S|, we are interested to construct a class of confidence regions (C S,α ) 0<α<1 for the components of Ω S such that sup 0<α<1 P(Ω S ∈ C S,α ) − α → 0 as n → ∞ in the high-dimensional scenario. As we will discuss in Section 4, such confidence regions can be employed to (i) test for specific structures of Ω, and (ii) detect and recover the nonzero components in Ω consistently. Before introducing the methodology to determine C S,α specified in (1), we first give some interpretation on the motivation of our focus in this paper through the following examples.
Example 1. (High-dimensional linear regression) Let z t and x t = (x 1,t , . . . , x m,t ) T be, respectively, the response variable and explanatory variables with zero mean. Suppose that z t and x t are linearly related as is the true regression coefficients. The coefficient γ l stands for the lth covariate effect on the response variable. In high-dimensional regression analysis, one goal is to separate zero covariate effects from the non-zero ones. Consider to test where A ⊂ {1, . . . , m} is a given index set. This testing problem can be formulated in the framework of the study in this paper. Write y t = (z t , x T t ) T and denote by Ω = (ω j 1 ,j 2 ) p×p the precision matrix of y t . It can be shown that (ω 1,2 , . . . , ω 1, (2) is equivalent to testing H 0 : ω 1,l = 0 for any l ∈ S vs.
Example 2. (Kalman recursions in the state-space model) In recent years state-space representations and the associated Kalman recursions have a profound impact on time series analysis and many related areas. Precision matrices are important in analyzing multivariate state-space models. Denote by WN(0, K) a white noise vector with mean 0 and covariance K. A state-space model for multivariate time series {y t } ∞ t=1 consists two equations: is involved in three fundamental problems associated with the statespace model: prediction, filtering and smoothing. Under the steady-state solution assumption that A t is independent of t, we can compute J t (t ≥ 2) via J 1 . Notice that J 1 = Var(y 1 ), thus J −1 1 is essentially the precision matrix Ω. A good estimate for Ω will lead to good performance of Kalman recursions under high-dimensional settings.
To obtain a good estimate for Ω, some prior structural information of Ω will be helpful. Bandness is one special structure of precision matrices. In many practical problems, the variables have a natural ordering that are related to the ones in their neighborhood. For instance, sales, prices and weather indices depend more on those at close range locations. The information from further locations may become redundant given that from neighbors. See, for example, Can and Megbolugbe (1997) for a house price data example which exhibits such a dependence structure. In those cases, there exists a constant k > 0 such that ω j 1 ,j 2 = 0 for any |j 1 − j 2 | > k. To explore and confirm such a structure, we consider the hypothesis testing problem H 0 : ω j 1 ,j 2 = 0 for any |j 1 − j 2 | > k vs.
for some positive integer k which may diverge to infinity as n, p → ∞. Bickel and Levina (2008a) proposed banding the Cholesky decomposition factor matrix to estimate the high-dimensional bandable Ω. Testing the hypothesis (4) is a practical guideline to confirm if Ω is within the bandable class so that the banding estimator can be applied. If the banded hypothesis is confirmed by the test, the banding estimators may be employed.
Example 3. (Partial correlation network) Given a precision matrix Ω = (ω j 1 ,j 2 ) p×p , we can define an undirected network G = (V, E) where the vertex set V = {1, . . . , p} represents the p components of y and the edge set E = {(j 1 , j 2 ) ∈ V × V : ω j 1 ,j 2 = 0} are the pairs of variables with non-zero precision coefficients. Let ρ j 1 ,j 2 = Corr(ε j 1 , ε j 2 ) be the partial correlation between the j 1 -th and the j 2 -th components of y for any j 1 = j 2 , where ε j 1 and ε j 2 are the errors of the best linear predictors of y j 1 and y j 2 given y −(j 1 ,j 2 ) = {y k : k = j 1 , j 2 }, respectively. It is known that ρ j 1 ,j 2 = − ω j 1 ,j 2 √ ω j 1 ,j 1 ω j 2 ,j 2 . Therefore, the network G = (V, E) also represents the partial correlation graph of y. The vertices (j 1 , j 2 ) ∈ E if and only if y j 1 and y j 2 are partially uncorrelated. Additionally, if y follows a multivariate normal distribution, such a network G is a Gaussian graphical model, which indicates the conditional dependence structure among the components of y.
Neighborhood and community are two basic features in a network. The neighborhood of the jth vertex, denoted by N j , is the set of all the vertices directly connected to it. For most of the spatial data, it is believed that the partial correlation neighborhood is related to the spatial neighborhood. Let N j (k) be the set including the first k closest vertices to the jth vertex in the spatial domain. To investigate such a relationship, we consider to test H 0 : N j = N j (k) versus H 1 : N j = N j (k) for some pre-specified positive constant k. A community in a network is a group of vertices that have heavier connectivity within the group than outside the group. The community structure builds the fundamental architecture of a network. Empirical evidence in the analysis of stock data indicates that the returns of the stocks from a same sector are highly correlated (Chan, Karceski and Lakonishok, 1999;Kenett et al., 2010). Intuitively, the partial correlation of the returns between two stocks from different sectors should be small. Hence, the sectors can be regarded as different communities in the stock return network. In brain connectivity studies, it has been shown in literature that the brain anatomical regions have dense connection within each lobes, which is considered as a community. The between-lobe connections are much fewer (Huang et al., 2010;Power et al., 2011). However, those between-lobe connections are still important in brain operating and information processing. It has been found that some neurodegenerative diseases, such as Alzheimer's Disease and Autism Spectrum Disorder, reduce the between-lobe connections compared with normal healthy individuals (Huang et al., 2010). Therefore, it is of practical importance to explore the connectivity between different communities. Assume the p components of y are decomposed into K disjoint communities V 1 , . . . , V K . We are interested to recover D = {(k 1 , k 2 ) : ω j 1 ,j 2 = 0 for some j 1 ∈ V k 1 and j 2 ∈ V k 2 }.

Estimation of Ω
To state our methodology, we first revisit the relationship between the precision matrix and nodewise regressions. For a random vector y = (y 1 , . . . , y p ) T with mean 0 and covariance Σ, we consider p node-wise regressions Let y −j 1 = {y j 2 : j 2 = j 1 }. The regression error j 1 is uncorrelated with y −j 1 if and only if α j 1 ,j 2 = − ω j 1 ,j 2 ω j 1 ,j 1 for any j 2 = j 1 . For such specified regression coefficients, it can be shown that Peng et al. (2009) for the above result. This relationship between Ω and V provides a way to learn Ω by the regression errors in (5).
Since the error vector in (5) is unobservable in practice, its "proxy" -the residuals of the node-wise regressions -can be used to estimate V. Let α j = (α j,1 , . . . , α j,j−1 , −1, α j,j+1 , . . . , α j,p ) T . For each j = 1, . . . , p, we fit the high-dimensional linear regression y j,t = k =j α j,k y k,t + j,t (t = 1, . . . , n) by Lasso (Tibshirani, 1996), Dantizg estimation (Candes and Tao, 2007) or scaled Lasso (Sun and Zhang, 2012). For the case µ = 0, the regression (6) will be conducted on the centered data y t −ȳ, whereȳ = n −1 n t=1 y t is the sample mean. For simplicity, we present our results under Lasso estimator. Other estimators can be applied similarly. Let α j be the Lasso estimator of α j defined as follows: where Θ j = {γ = (γ 1 , . . . , γ p ) T ∈ R p : γ j = −1} and λ j is the tuning parameter. For each t, the residual provides an estimate of j,t . Write t = ( 1,t , . . . , p,t ) T and let V = ( v j 1 ,j 2 ) p×p be the sample covariance of { t } n t=1 , where v j 1 ,j 2 = n −1 n t=1 j 1 ,t j 2 ,t . It is well known that n −1 n t=1 j 1 ,t j 2 ,t is an unbiased estimator of v j 1 ,j 2 , however, replacing j 1 ,t by j 1 ,t will incur a bias term. Specifically, as shown in Lemma 3 in Appendix, it holds that Here the higher order term o p {(n log p) −1/2 } holds uniformly for any j 1 and j 2 . Since n −1 n t=1 2 j,t is n 1/2 -consistent to v j,j , (9) implies that v j,j is also n 1/2 -consistent to v j,j . However, for any j 1 = j 2 , due to the slow convergence rates of the Lasso estimators α j 1 ,j 2 and α j 2 ,j 1 , v j 1 ,j 2 is no longer n 1/2 -consistent to v j 1 ,j 2 . To eliminate the bias, we employ a new estimator below for v j 1 ,j 2 : By noticing that for any j 1 and j 2 . To study the theoretical properties of this estimator ω j 1 ,j 2 , we need the following regularity conditions. Condition 1. There exist constants K 1 > 0, K 2 > 1, 0 < γ 1 ≤ 2 and 0 < γ 2 ≤ 2 independent of p and n such that for each t = 1, . . . , n, Condition 2. The smallest eigenvalues of Σ and Ω are uniformly bounded away from zero.
Condition 3. There exists constants K 3 > 0 and γ 3 > 0 independent of p and n such that β k ≤ exp(−K 3 k γ 3 ) for any positive k.
Condition 1 implies max 1≤j≤p P(|y j,t | ≥ x) ≤ K 2 exp(−K 1 x γ 1 ) and max 1≤j≤p P(| j,t | ≥ x) ≤ K 2 exp(−K 1 x γ 2 ) for any x > 0 and t = 1, . . . , n. It ensures exponential-type upper bounds for the tail probabilities of the statistics concerned (see for example Lemma 1 in Appendix), which makes our procedure work for p diverging at some exponential rate of n. Condition 2 implies the bounded eigenvalues of Σ and Ω. Such a regularized condition for eigenvalues is commonly assumed in the literatures of high-dimensional data analysis. Condition 3 for the β-mixing coefficients of {y t } is mild. Causal autoregressive moving average processes with continuous innovation distributions are β-mixing with exponentially decaying β k . So are stationary Markov chains satisfying certain conditions. See Section 2.6.1 of Fan and Yao (2003) and the references within. In fact, stationary GARCH models with finite second moments and continuous innovation distributions are also βmixing with exponentially decaying β k ; see Proposition 12 of Carrasco and Chen (2002). If we only require sup t max 1≤j≤p P(|y j,t | > x) = O{x −2(ν+ι) } and sup t max 1≤j≤p P(| j,t | > x) = O{x −2(ν+ι) } for any x > 0 in Condition 1 and β k = O{k −ν(ν+ι)/(2ι) } in Condition 3 for some ν > 2 and ι > 0, we can apply Fuk-Nagaev-type inequalities to construct the upper bounds for the tail probabilities of the statistics if p diverges at some polynomial rate of n. We refer to Section 3.2 of Chang, Guo and Yao (2014) for the implementation of Fuk-Nagaev-type inequalities in such a scenario. The β-mixing condition can be replaced by the α-mixing condition, under which we can justify the proposed method for p diverging at some polynomial rate of n by using Fuk-Nagaev-type inequalities. However, it remains an open problem to establish the relevant properties under α-mixing for p diverging at some exponential rate of n. The following proposition gives the asymptotic expansion of ω j 1 ,j 2 .
We see from Proposition 1 that such defined ω j 1 ,j 2 is centered at the true parameter ω j 1 ,j 2 with a standard deviation at the order n −1/2 . Under the condition s n 1/2 , a commonly used assumption in the literatures of high-dimensional data analysis, Proposition 1 holds when log p = o(n c ) for some positive constant c. This means that Proposition 1 is valid even the dimension p grows at some exponential rate of the sample size n. Ren et al. (2015) proposed an estimator for ω j 1 ,j 2 via pairwise regression. Specifically, they fitted each pair of variables (y j 1 , y j 2 ) on all other variables y −(j 1 ,j 2 ) via scaled Lasso, which requires p(p−1) 2 high-dimensional regressions. As a comparison, our proposed estimator (10) only needs p regressions, which dramatically reduces the computation burden when p is large. Liu (2013) (10) and some specified scale c j 1 ,j 2 as the statistic to detect whether ω j 1 ,j 2 = 0 or not. He showed that ) is asymptotically normal distributed with some random variable b j 1 ,j 2 , which indicates the asymptotical normality of −n 1/2 c j 1 ,j 2 v j 1 ,j 2 only when ω j 1 ,j 2 = 0. However, such a result under ω j 1 ,j 2 = 0 is not sufficient to construct confidence regions for the non-zero ω j 1 ,j 2 's. The asymptotic expansion of ω j 1 ,j 2 in Proposition 1 is a more delicate result, which is necessary for our analysis.

Confidence regions
. It follows from Proposition 1 that Restricted on a given index set S with r = |S|, we have Based on (12), we consider two choices below for the confidence regions C S,α specified in (1): where D is an r × r diagonal matrix specified later in Remark 5, of which the elements are the estimated standard derivations of the r components in n 1/2 ( Ω S − Ω S ). Here q S,α,1 and q S,α,2 are two critical values to be determined. C S,α,1 and C S,α,2 represent the so-called "non-Studentizedtype" and "Studentized-type" confidence regions for Ω S , respectively. As comment by Chang et al. (2017a), the Studentized-type confidence regions perform better than the non-Studentizedtype ones when the heteroscedasticity exists, however, the performance of the non-Studentized confidence regions is more stable when the sample size n is fairly small.
In the sequel, we mainly focus on estimating the critical value q S,α,1 in (13). By the same way, q S,α,2 can be estimated similarly, which will be discussed in Remark 5 later. To determine q S,α,1 , we need to first characterize the probabilistic behavior of n 1/2 | Ω S − Ω S | ∞ . Since Υ S is a higher order term, n 1/2 | Ω S − Ω S | ∞ will behave similarly as n 1/2 |Π S | ∞ when n is large. For each t, let ς t be a r-dimensional vector whose jth element is where χ(·) = {χ 1 (·), χ 2 (·)} is a bijective mapping from {1, . . . , r} to S such that Ω S = {ω χ(1) , . . . , ω χ(r) } T . Then, we have Denote by W the long-run covariance of {ς t } n t=1 , namely, (14) (14) can be written as where To study the asymptotical distribution of the average of the temporal dependent sequence {ς t } n t=1 and its long run covariance W, we make the following condition on {η t } n t=1 .
Condition 4. There exist constants K 4 > 0 and ι > 0 such that Condition 4 is a technical assumption for the validity of the Gaussian approximation for dependent data. See Chernozhukov, Chetverikov and Kato (2014). Such condition is mild. Based on Condition 1, same as Lemma 2 of Chang, Tang and Wu (2013), we have sup t max 1≤j≤r P(|η j,t | > x) ≤ C 1 exp(−C 2 x γ 2 /2 ) for any x > 0, where C 1 and C 2 are two positive constants only depending on K 1 and K 2 specified in Condition 1. Together with Condition 3, it holds that for any j and . Furthermore, it can be shown that for any x > 0, where C 3 , C 4 and γ are three positive constants only depending on the uniform constants specified in Conditions 1 and 3. Therefore, lim sup b→∞ sup 1≤ ≤n+1−b E(|b −1/2 +b−1 t= η j,t | 2+ι ) < K 4 holds automatically for some positive constant K 4 , provided that Conditions 1 and 3 hold. Let σ 2 j, (b) = Var(b −1/2 +b−1 t= η j,t ) for any j, and b. Then for any j and , we have and σ 2 j, 's are uniformly bounded away from zero for any j and , then by Jensen's inequality we have lim holds for some positive constant K 4 . If {η j,t } t≥1 is stationary, Conditions 1 and 3 implies The next theorem shows that the probabilistic behavior of n 1/2 | Ω S − Ω S | ∞ can be approximated by that of |ξ| ∞ for ξ ∼ N (0, W). (14). Under the conditions of Proposition 1 and Condition 4, we have 2 is a positive constant specified in the proof of this theorem in Appendix.
Remark 1. Theorem 1 shows that the Kolmogorov distance between the distributions of n 1/2 | Ω S − Ω S | ∞ and |ξ| ∞ converges to zero in the high-dimensional scenario. More specifically, as shown in the proof of Theorem 1 in Appendix, this convergence rate is O(n −C ) for some constant C > 0. Under certain conditions, the L ∞ -type statistic n 1/2 | Ω S −Ω S | ∞ will converge weakly to some kind of extreme distribution. However, such convergence usually requires some stringent assumptions on the structure of the underlying covariance W and suffers from low accuracy as the rate of convergence to the extreme distribution is fairly slow. Taking the extreme distribution of type I as an example, the convergence rate is of order O{log(log n)/ log n}. These defects may cause the poor performance of the limiting distribution calibration approach for the L ∞ -type statistic in finite samples. However, our approximation strategy does not require such stringent assumptions on W and has faster convergence rate.
Theorem 1 provides a guideline to approximate the distribution of n 1/2 | Ω S − Ω S | ∞ . To implement it in practice, we need to propose an estimator for W. Denote by Ξ the matrix sandwiched by H's on the right-hand side of (15), which is the long-run covariance of {η t } n t=1 . Notice that v j,j defined in (10) is n 1/2 -consistent to v j,j , we can estimate H by Based on the Γ k 's, we propose a kernel-type estimator suggested by Andrews (1991) for Ξ as where S n is the bandwidth, K(·) is a symmetric kernel function that is continuous at 0 and satisfying K(0) = 1, |K(u)| ≤ 1 for any u ∈ R, and ∞ −∞ K 2 (u)du < ∞. Given H and Ξ defined respectively in (16) and (17), an estimator for W is given by In finite samples, Theorem 2 below shows that we can approximate the distribution of Remark 2. Andrews (1991) systematically investigated the theoretical properties of such a kind of estimators in the fixed dimensional framework, and shows that the Quadratic Spectral kernel is the optimal one for estimating the long-run covariance in the sense of minimizing the asymptotic truncated mean square error. A data-driven bandwidth selection procedure for the Quadratic Spectral kernel was studied in Section 6 of Andrews (1991), which can be computed efficiently.
We will adopt the Quadratic Spectral kernel K QS (·) with its data-driven bandwidth selection procedure in our numerical studies. Both our theoretical and simulation results show that this kernel estimator Ξ performs very well even in high-dimensional scenarios. There also exist various other estimation methods for long-run covariances, including the estimators utilizing the moving block bootstraps (Lahiri, 2003;Nordman and Lahiri, 2005). Also see Den Han and Levin (1997) and Kiefer, Vogelsang and Bunzel (2000). Compared to those methods, the kernel-type estimators applied in our procedure have two advantages. First, it does not require the stationary assumption imposed on {η t } n t=1 . Second, the form of kernel-type estimator given in (17) can be applied to simplify the data generating mechanism for a high-dimensional Gaussian random vector with covariance W, and therefore can significantly improve the computational efficiency when p is large. See Remark 4 later for more detailed discussion on the computational issues. (18). Assume the kernel function K(·) satisfy |K(x)| |x| −τ as x → ∞ for some τ > 1, and the bandwidth S n n ρ for some 0 < ρ < min{ τ −1 3τ , γ 3 2γ 3 +1 } and γ 3 in Condition 3. Under the conditions of Theorem 1, it holds that is a positive constant specified in the proof of this theorem in Appendix, and Y n = {y 1 , . . . , y n }.
Remark 3. From the theoretical property of the Gaussian approximation technique proposed by Chernozhukov, Chetverikov and Kato (2013), we know the result stated in Theorem 2 is valid for any W satisfying | W − W| ∞ = o p (1). The proposed procedure requires the estimation of a high-dimensional covariance matrix W. However, different from the widely used estimators in the literatures of the high-dimensional covariance estimation, our procedure does not need some specific structural assumptions, such as sparsity or bandableness, imposed on W. This advantage widens the applications of the proposed method.
In finite samples, we can use the Monte Carlo simulation to approximate the distribution of | ξ| ∞ given the data Y n . Specifically, let ξ 1 , . . . , ξ M be i.i.d. r-dimensional random vectors drawn from N (0, W). Then the conditional distribution of | ξ| ∞ given Y n can be approximated by the Then, q S,α,1 specified in (13) can be estimated by To improve computational efficiency, we propose the following Kernel based Multiplier Bootstrap (KMB) procedure to generate ξ: Step 1. Let A be a n × n matrix whose (i, j)-th element is K(|i − j|/S n ), and generate an n-dimensional Gaussian random vector g = (g 1 , . . . , g n ) T with mean 0 and covariance A.
It is easy to see that ξ ∼ N (0, W) given Y n . Note that this KMB procedure only requires to generate an n-dimensional Gaussian random vector in each bootstrap sample.
Remark 4. The classical approach to draw a random vector ξ ∼ N (0, W) consists of three steps: (i) perform the Cholesky decomposition on the r×r matrix W = L T L, (ii) generate r independent standard normal random variables z = (z 1 , . . . , z r ) T , (iii) perform transformation ξ = L T z. Thus, it requires to store matrix W and { η t } n t=1 , which amounts to the storage costs O(r 2 ) and O(rn), respectively. The computational complexity is O(r 2 n + r 3 ), mainly due to computing W and the Cholesky decomposition. When r = O(p 2 ) and p is large, the classical approach requires intensive computation and high storage. However, the new data generating mechanism KMB makes our procedure practically feasible even when p is large. Notice that the proposed KMB procedure only needs to store { η t } n t=1 and A, and draw an n-dimensional random vector g ∼ N (0, A) in each bootstrap sample, which amounts to total storage cost O(rn + n 2 ). More significantly, the computational complexity of the KMB procedure is only O(n 3 ) which is independent of r and p.
Corollary 1. Assume conditions of Theorem 2 hold. It holds that: (14), and C is a constant larger than √ 2, then From Corollary 1, we see that the empirical size of the proposed testing procedure Ψ α will converge to its nominal level α under H 0 . Comparing the proposed test to the L ∞ -type test based on limiting distribution calibration for determining the critical value q S,1−α,1 , the later will suffer from the size distortion, due to the slow convergence rate of n 1/2 | Ω S − c| ∞ to its limiting distribution (see Remark 1). Since the KMB procedure enjoys much faster convergence rate, the proposed test Ψ α will have more accurate size.
j,j specifies the maximal deviation of the precision matrix from the null hypothesis H 0 : ω j 1 ,j 2 = c j 1 ,j 2 for any (j 1 , j 2 ) ∈ S. The power of the proposed test Ψ α will approach 1, if this maximal signal strength is larger than C 6 (n −1 log p) 1/2 for some positive constant C 6 . Such a condition on signal strength is commonly assumed for studying the power of the L ∞ -type test. See Jiang (2011), Cai, Liu andXia (2013) and Chang et al. (2017a,b). A "Studentized-type" test can be similarly constructed via replacing n 1/2 | Ω S − c| ∞ and q S,1−α,1 by n 1/2 | D −1 ( Ω S − c)| ∞ and q S,1−α,2 in (13), respectively.
In our context, note that the false positive means estimating the zero ω j 1 ,j 2 as non-zero. Let FP be the number of false positive errors conducted by the estimated signal set M n,α . Let the family wise error rate (FWER) be the probability of conducting any false positive errors, namely, FWER = P(FP > 0). See Hochberg and Tamhane (2009) for various types of error rates in multiple testing procedures. Notice that P(FP > 0) ≤ P(Ω S ∈ C S,1−α,1 ) = α{1 + o(1)}. This shows that the proposed method is able to control family wise error rate at level α for any α ∈ (0, 1). The following corollary further shows the consistency of M n,α .
Corollary 2. Assume conditions of Theorem 2 hold, and the signals satisfy min (j 1 ,j 2 )∈M 0 |ω j 1 ,j 2 | ≥ C(n −1 log p) 1/2 max 1≤j≤r w 1/2 j,j where w j,j is the jth component in the diagonal of W defined in (14), and C is a constant larger than √ 2. Selecting α → 0 such that 1/α = o(p), it holds that From Corollary 2, we see that the selected set M n,α can identify the true set M 0 consistently if the minimum signal strength min (j 1 ,j 2 )∈S |ω j 1 ,j 2 | is larger than C 7 (n −1 log p) 1/2 for some positive constant C 7 . Notice from Corollary 1 that only the maximum signal is required in the power analysis of the proposed testing procedure. Compared to signal detection, signal identification is a more challenging problem. The full support recovery of Ω require all non-zero |ω l 1 ,l 2 | larger than a specific level. Similarly, we can also define M n,α via replacing C S,1−α,1 by its "Studentized-type" analogue C S,1−α,2 in (13).
For each of the cases above, we examined the accuracy of the proposed KMB and SKMB approximations to the distributions of the non-Studentized-type statistic n 1/2 | Ω S − Ω S | ∞ and the Studentized-type statistic n 1/2 | D −1 ( Ω S − Ω S )| ∞ , respectively. More specifically, we first draw 1000 independent samples where each sample with size n was generated by the above discussed data generating mechanism. We then computed n 1/2 | Ω S −Ω S | ∞ and n 1/2 | D −1 ( Ω S −Ω S )| ∞ in each sample and employed their empirical distributions as the benchmark for their true distributions. For α = 0.075, 0.050 and 0.025, we applied the KMB and SKMB procedures in each sample to estimate the 100(1−α)% quantiles of n 1/2 | Ω S −Ω S | ∞ and n 1/2 | D −1 ( Ω S −Ω S )| ∞ , respectively, with bootstrap resampling M = 3000 for each of the procedures. Based on the benchmark distributions of n 1/2 | Ω S −Ω S | ∞ and n 1/2 | D −1 ( Ω S −Ω S )| ∞ given before, we computed the associated empirical coverages for the estimated quantiles in each sample. We report the averages and standard deviations of such determined 1000 empirical coverages in Tables 1-3 corresponding to the cases p = 100, p = 500 and p = 1500, respectively.
It is worth noting that in order to accomplish the statistical computing for large p under the R environment in high speed, we programmed the generation of random numbers and most loops into C functions such that we utilized ".C()" routine to call those C functions from R. However, the computation of the two types of statistics involves the fitting of the p node-wise regressions. As a consequence, the simulation for large p still requires a large amount of computation time. For instance, out of 1000 simulations, one simulation had taken more than 6 hours to complete the computation as p = 1500 and n = 300. In order to overcome this time-consuming issue, the computation in this numerical study was undertaken with the assistance of the supercomputer Raijin at the NCI National Facility systems supported by the Australian Government. The supercomputer Raijin comprises 57,864 cores, which helped us parallel process a large number of simulations simultaneously.
From Tables 1-3, we observe that, for both KMB and SKMB procedures, the overall differences between the empirical coverage rates and the corresponding nominal levels are small, which demonstrates that the KMB and SKMB procedures can provide accurate approximations to the distributions of n 1/2 | Ω S − Ω S | ∞ and n 1/2 | D −1 ( Ω S − Ω S )| ∞ , respectively. Also note that the coverage rates improve as n increases. And, our results are robust to the temporal dependence parameter ρ, which indicates the proposed procedures are adaptive to time dependent observations.
Comparing the simulation results indicated by KMB and SKMB in the category S = {(j 1 , j 2 ) : j 1 = j 2 } of Tables 1-3, when the dimension is less than the sample size (p = 100, n = 150, 300), we can see that the SKMB procedure has better accuracy than the KMB procedure if the heteroscedastic issue exists. This finding also exits when the dimension is over the sample size and both of them are large (n = 300, p = 1500). For the homogeneous case S = {(j 1 , j 2 ) : ω j 1 ,j 2 = 0}, the KMB procedure provides better accuracy than the SKMB procedure when sample size is small (n = 150). However, when the sample size becomes larger (n = 300), the accuracy of the SKMB procedure can be significantly improved and it will outperform the KMB procedure. The phenomenon that the SKMB procedure sometimes cannot beat the KMB procedure might be caused by incorporating the estimated standard deviations of ω j 1 ,j 2 's in the denominator of the Studentized-type statistic, which suffers from high variability when the sample size is small. The simulation results suggest us that: (i) when the dimension is less than the sample size or both the dimension and the sample size are very large, the SKMB procedure should be used to construct the confidence regions of Ω S if the heteroscedastic issue exists; (ii) if the sample size is small, and we have some previous information that there does not exist heteroscedastic, then the KMB procedure should be used to construct the confidence regions of Ω S . However, even in the homogeneous case, the SKMB procedure should still be employed when the sample size is large. In practice, if the dimension is less than the sample size or both the dimension and the sample size are very large, we may select SKMB procedure; otherwise, we may select KMB procedure.

Real data analysis
In this section, we follow Example 3 in Section 2 to study the partial correlation networks of the Standard and Poors (S&P) 500 Component Stocks in 2005 (252 trading days, preceding the crisis) and in 2008 (253 trading days, during the crisis), respectively. The reason to analyze those two periods is to understand the structure and dynamic of financial networks affected by the global financial crisis (Schweitzer et al., 2009). Aït-Sahalia and Xiu (2015) analyze the data in 2005 and 2008 as well in order to investigate the influence of the financial crisis.
The S&P 500 companies are 500 large-capital companies. Their stocks are traded on American stock exchanges, and cover about 75% of the American equity market by capitalization. We analyze the data from http://quote.yahoo.com/ via the R package tseries, which contains the daily closing prices of S&P 500 stocks. The R command get.hist.quote can be used to acquire the data. We keep 402 stocks in our analysis whose closing prices are capable of being downloaded by the R command and do not have any missing values during 2005 and 2008. Let y j,t be the jth stock price at day t. We consider the log return of the stocks, which is defined by log(y j,t ) − log(y j,t−1 ). Via standardizing the log return of each stock by its mean and standard deviation, we can obtain the standardized log returns R t = (R 1,t , . . . , R 402,t ) T of all the 402 assets at day t. It is widely acknowledged that the stock return is influenced by its performance in the past time. Thus, the S&P 500 data are time dependent.
Let Ω = (ω j 1 ,j 2 ) p×p be the precision matrix of R t . By the relationship between partial correlation and precision matrix, the partial correlation network can be constructed by the non-zero precision coefficients ω j 1 ,j 2 as demonstrated in Example 3 in Section 2. Under the Gaussian graphical model, zero precision coefficient represents the conditional independence of two assets given all the other assets. To learn the structures of Ω, we focus on the Global Industry Classification Standard (GICS) sectors and their sub industries of the S&P 500 companies, and aim to discover the sub blocks of Ω which are nonzero. Those blocks can help us build the partial correlation networks of the sectors and sub industries for the S&P 500 stocks in 2005 and 2008, respectively.
The advantage of investigating the complex financial network system by partial correlation is to overcome the issue that the marginal correlation between two stocks might be a result of their correlations to other mediating stocks (Kenett et al., 2010). For example, if two stocks R j 1 ,t and R j 2 ,t are both correlated with some stocks in the set R −(j 1 ,j 2 ),t = {R j,t : j = j 1 , j 2 }, the partial correlation can suitably remove the linear effect of R −(j 1 ,j 2 ),t on R j 1 ,t and R j 2 ,t . Hence, it measures a "direct" relationship between j 1 and j 2 (De La Fuente et al., 2004). The partial correlation analysis is widely used in the study of financial networks (Shapira, Kenett and Ben-Jacob, 2009;Kenett et al., 2010), as well as the study of gene networks (De La Fuente et al., 2004;Reverter and Chan, 2008;Chen and Zheng, 2009).
Based on the information on bloomberg and "List of S&P 500 companies" on wikipedia, we identify 10 major sectors with 54 sub industries of the S&P 500 companies (see Table 4 and Table  5 for detailed categories). The 10 sectors are Consumer Discretionary, Consumer Staples, Energy, Financials, Health Care, Industrials, Information Technology, Materials, Telecommunication Services and Utilities. There are 1 company with the unidentified sector and 8 companies with unidentified sub industries due to acquisition or ticket change (represented by "NA" in Table 4 and Table 5).
We construct the partial correlation networks based on the significant blocks from the above multiple testing procedure. The estimated networks are shown in Figures 1 and 2, corresponding to 2005 and 2008, respectively. The left panels in Figures 1 and 2 give the partial correlation networks of the sectors, where the nodes represent the 10 sectors, and two nodes (sectors) h 1 and h 2 are connected if and only if the precision matrix Ω on their crossing block I h 1 ×I h 2 is significant non-zero. The right panels are the partial correlation networks of the 54 sub industries, labeled by numbers from 1 to 54. The corresponding names of each sub industries can be found in Tables 4 and 5. The shaded areas with different colors represent the 10 major sectors, respectively. This network provides more detailed connectivity information within and between sectors.  Figure 1: Partial correlation networks of S&P 500 sectors and sub industries in 2005 (preceding the crisis). The detailed information of the sub industries represented by numbers 1-54 in the right panel can be correspondingly found in Tables 4 and 5.  Figure 2: Partial correlation networks of S&P 500 sectors and sub industries in 2008 (during the crisis). The detailed information of the sub industries represented by numbers 1-54 in the right panel can be correspondingly found in Tables 4 and 5.
We observe from the left panel of Figure 1 that preceding the crisis in 2005, the Industrials sector is likely to be a hub connecting to 5 other sectors: Consumer Discretionary, Energy, Health Care, Utilities and Material. It is the most influential sector with the largest degree, i.e., the total number of directed links connecting to the Industrials sector in the network. However, during the crisis in 2008, the most influential sector shifts to Consumer Discretionary as shown by the left panel of Figure 2. The Financials sector is separated from the other sectors except for Consumer Discretionary in 2008, in contrast with the network connectivity in 2005. The edges of the network during the crisis becomes less as well. The similar situation also appears in the partial correlations networks of S&P 500 sub industries as shown in the right panels of Figures  1 and 2. More specifically, both the numbers of the edges within and between sectors for the network of S&P 500 sub industries in 2008 are significantly less than those in 2005 (see Table  6 for details), which indicates that the market fear in the crisis broke the connections of stock sectors and sub industries. From the perspective of financial network studies, the above analysis confirms that fear froze the market in the 2008 crisis (Reavis, 2012).

Acknowledgments
The authors are grateful to the Co-Editor, an Associate Editor and two anonymous referees for constructive comments and suggestions. This research was undertaken with the assistance of resources provided at the NCI National Facility systems at the Australian National University through the ANU Allocation Scheme supported by the Australian Government. Jinyuan Chang was supported in part by the Fundamental Research Funds for the Central Universities
By Bonferroni inequality, we have Let x = A 1 (n −1 log p) 1/2 , we obtain the first conclusion. Following the same arguments stated above, we can establish the other inequalities.
For suitable selection of A 4 , we have P(T c ) → 0 as n → ∞. Thus, from (21), it holds that On the other hand, notice that by Condition 2, Lemma 1, (22) and (23), we have Hence, we complete the proof.
Lemma 3. Assume the conditions for Lemmas 1 and 2 hold, then Here the remainder term o p {(n log p) −1/2 } is uniformly for any j 1 and j 2 .
Proof: Notice that j,t = −α T j y t and j,t = − α T j y t for any t, then Condition 2, Lemmas 1 and 2 imply that Meanwhile, by Lemma 1, we have max 1≤j≤p max k =j |n −1 n t=1 j,t y k,t | = O p {(n −1 log p) 1/2 }, which implies that Here the remainder term is uniform for any j 1 and j 2 . On the other hand, n −1 n t=1 y j,t j,t = n −1 n t=1 2 j,t +n −1 n t=1 α T j,−j y −j,t j,t . By Lemma 1, it yields that n −1 n t=1 y j,t j,t = n −1 n t=1 2 j,t + O p {(n −1 log p) 1/2 }. Here the remainder term is uniform for any j. Together with (24), we have Here the remainder term is also uniform for any j 1 and j 2 . Hence, We complete the proof.
Analogously, to show (29), it suffices to show sup z∈R |P(max 1≤j≤r u j > z) − P(max 1≤j≤r ξ j > z)| → 0. Write W as the covariance of u. Recall W denotes the covariance of ξ. Lemma 3.1 of Chernozhukov, Chetverikov and Kato (2013) We will specify the convergence rate of | W − W| ∞ below. Notice that, for any 1 ≤ j 1 , j 2 ≤ r, we have With the triangle inequality, it yields that For each = 1, . . . , m, the following identities hold: Together with the triangle inequality, Davydov inequality and Cauchy-Schwarz inequality, we have By the proof of Lemma 2 in Chang, Chen and Chen (2015), we can prove that Specifically, notice that where we set I m+1 = ∅. By Cauchy-Schwarz inequality and Davydov inequality, we have Similarly, we can bound the other terms in (36). Therefore, we have (35) holds which implies that . For b, h and D n specified above, (34) implies sup z∈R |P(max 1≤j≤r u j > z) − P(max 1≤j≤r ξ j > z)| → 0. Then we construct the result (29). Hence, we complete the proof of Theorem 1.
Proof of Theorem 2: Similar to the proof of (29), it suffices to prove | W − W| ∞ = o p (1). By Lemmas 1 and 3, we have max 1≤j≤p | v j,j − v j,j | = O p {(n −1 log p) 1/2 }. Notice that v j,j 's are uniformly bounded away from zero, then v −1 j,j 's are uniformly bounded away from infinity with probability approaching one. Thus, We will specify the convergence rates of | Ξ − Ξ| ∞ and | Ξ − Ξ| ∞ , respectively. Notice that For any k ≥ 0, it holds that We will prove the | · | ∞ -norm of the last three terms on the right-hand side of above identity are O p {sS n (n −1 log p) 1/2 }. We only need to show this rate for one of them and the proofs for the other two are similar. For any j and t, Here the term O p {(n −1 log p) 1/2 } is uniform for any j and t. Then the (j 1 , j 2 )-th component of where Here the term O p {S n (n −1 log p) 1/2 } is uniform for any j 1 and j 2 . Following the same arguments, we have sup 1≤j 1 ,j 2 ≤p Therefore, the (j 1 , j 2 )-th component of n−1 where the last identity in above equation is based on (23). Therefore, from (48), by Lemma 4, we have Analogously, we can prove the same result for | −1 k=−n+1 K(k/S n )( Γ k − Γ k )| ∞ . Therefore, | Ξ − Ξ| ∞ = O p [{log(pn)} 4/γ 2 n −f (α 0 )/2 ]+O p {sS n (n −1 log p) 1/2 }. Repeating the the proof of Proposition 1(b) in Andrews (1991), we know the convergence in Proposition 1(b) is uniformly for each component of Ξ − Ξ. Thus, | Ξ − Ξ| ∞ = o(1). Then | Ξ − Ξ| ∞ = o p (1). Similar to (34), we complete the proof.