A fast and consistent variable selection method for high-dimensional multivariate linear regression with a large number of explanatory variables

: We put forward a variable selection method for selecting ex- planatory variables in a normality-assumed multivariate linear regression. It is cumbersome to calculate variable selection criteria for all subsets of explanatory variables when the number of explanatory variables is large. Therefore, we propose a fast and consistent variable selection method based on a generalized C p criterion. The consistency of the method is provided by a high-dimensional asymptotic framework such that the sample size and the sum of the dimensions of response vectors and explanatory vectors divided by the sample size tend to inﬁnity and some positive constant which are less than one, respectively. Through numerical simulations, it is shown that the proposed method has a high probability of selecting the true subset of explanatory variables and is fast under a moderate sample size even when the number of dimensions is large.


Introduction
Multivariate linear regression is a widely known method of inferential analysis. It features in many theoretical and applied textbooks (see, e.g., [21, chap 9], [24, chap 4]) and it is used by researchers in many fields. Let Y be an n × p observation matrix of p response variables and X be an n×k observation matrix of k non-stochastic explanatory variables, where n is the sample size, and p and k are the numbers of response variables and explanatory variables, respectively. Let N = n − p − k + 1, and we assume that rank(X) = k < n and (n, p, k) satisfies N − 4 > 0 in proposing our method.
In actual empirical contexts, it is important to specify the factors affecting response variables. In multivariate linear regression, this is regarded as the problem of selecting a subset of explanatory variables. Suppose that j denotes a subset of the full set ω = {1, . . . , k} containing k j elements, and X j denotes the n × k j matrix consisting of columns of X indexed by the elements of j, where k A denotes the number of elements in a set A, i.e., k A = #(A). Next, j expresses the subset of explanatory variables. For example, if j = {1, 2, 4}, then X j consists of the first, second and fourth column vectors of X. Using the notation j, the candidate model with k j explanatory variables is expressed as follows: where Θ j is a k j × p unknown matrix of regression coefficients and Σ j is a p×p unknown covariance matrix. In particular, the total number of explanatory variables k ω and the explanatory matrix X ω in the full model ω express k and X, respectively. Herein, we assume that the data are generated from the following true model with k j * explanatory variables: where Θ * is a k j * × p true unknown matrix of regression coefficients and Σ * is a p × p true unknown covariance matrix assuming that Σ * is positive definite. Without loss of generality, we sort column vectors of X as X = (X j * , X j c * ), where set A c denotes the compliment of set A. For expository purposes, we represent k j * and X j * as k * and X * , respectively.
To systematize and optimize the configuration of models, variable selection criteria have been widely used. The C p criterion was proposed by [13,14]. In this paper, we focus on a generalized variable selection criterion based on the C p criterion, termed the Generalized C p (GC p ) criterion. The GC p criterion for a linear regression with a single response was proposed by [1], and the counterpart for a multivariate linear regression with multiple responses was proposed by [15]. The GC p criterion can express a wide variety of variable selection criteria, e.g., the C p criterion for multivariate contexts proposed by [20], and the modified C p (MC p ) criterion proposed by [3].
The best subset chosen by a variable selection criterion is usually defined as the subset of explanatory variables which minimizes the value of that criterion among all candidate subsets. The basic approach to identifying the best subset involves searching over all candidate subsets. We call this method the "full search method". To elaborate, assuming a full search method is used, variable selection criteria for 2 k − 1 subsets need to be calculated. Recently, increasing attention has been paid to investigating statistical methods for high-dimensional data, in which the dimension of response vectors p or the number of explanatory variables k is large. However, in high-dimensional data contexts, particularly where k is large, it may be impossible to apply the full search method because the total number of subsets of explanatory variables exponentially increases when k becomes large. For example, if k = 40 and the time taken to calculate a variable selection criterion for a subset is 0.01 seconds, then the time required to implement the full search method will be (2 40 − 1) × 0.01 seconds, i.e., about 35 years. Thus, for practical reasons, we need another search method when k is large. A practicable selection method was proposed by [17,31] when k is large. This method is based on the behavior of variable selection criteria for the subset where a variable is removed from the full set ω. In that selection method, the best subsetĵ is determined as follows. For each explanatory variable, if the criterion for the subset where a variable is removed from ω is greater than the criterion for the full set ω, then the removed variable is regarded as the element of the best subset. Since this method is needed to calculate variable selection criteria for only k subsets and ω for searching the best subsetĵ, we expect that the method is faster than the full search method, and it is practical for highdimensional data contexts. We call this method the "ZKB selection method" and consider it using a class of the GC p criterion, where "ZKB" is formed from the initial letters of the authors in [31].
An important property of a variable selection criterion is its consistency. Consistency is achieved where the probability of selecting the true subset j * converges to 1, i.e., P (ĵ = j * ) → 1. However, since we do not know the true subset j * , we often hope to specify j * by variable selection. Then, we should use a variable selection criterion that maximizes the probability of selecting the true subset. It is expected that a consistent variable selection criterion has a highprobability of selecting the true subset j * because in general the probability of selecting the true subset is approximated by the asymptotic probability. To this end, let LS, LR, LE and LTE be the large-sample (LS), large-response vector (LR), large-explanatory vector (LE) and large-true explanatory vector (LTE) asymptotic frameworks such that only n, p, k and k * tend to infinity, respectively. Further, they are denoted by LS: n → ∞, LR: p → ∞, LE: k → ∞ and LTE: k * → ∞. LS was used by [16,17,19,31] under the ZKB selection method. However, it is not appropriate to use LS for high-dimensional data because approximate accuracy using LS deteriorates as p or k become large. Hence, criteria used by [16,17,19,31] may not have consistency under the ZKB selection method when p or k tend to infinity. In the context for the consistency of variable selection criteria under the full search method, [4,27,28] used the following asymptotic frameworks as (p + k)/n → c ∈ [0, 1): [4]: LS and LR, [27]: LS or (LS and LR), [28]: (LS and LR) or (LS and LR and LE) or (LS and LR and LTE) as k/n → 0.
Since as described above [4,27,28] used asymptotic frameworks such that not only n but also p, k or k * tend to ∞, the probabilities of selecting the true subset will be high for high-dimensional data suited to the used asymptotic frameworks. However, the probabilities may become low for high-dimensional data not suited to the used asymptotic frameworks. Moreover, it is hard for us to judge whether p, k and k * are large or not, and so we do not know which asymptotic framework is suitable to given data. Hence, to ensure the consistency, it is more desirable to use an asymptotic framework regardless of sizes of p, k and k * .
In this paper, we consider the consistency of the GC p criterion under the ZKB selection method and propose the new consistent ZKB selection method even in high-dimensional contexts. Moreover, we also propose the selection method which can perform group selections. To achieve this, we use the following highdimensional (HD) asymptotic framework: Importantly, the HD asymptotic framework can be rewritten as HD: LS or (LS and LR) or (LS and LE) or (LS and LTE) or (LS and LR and LE) or (LS and LR and LTE) as (p + k)/n → c ∈ [0, 1).
This means that n always tends to infinity, but p, k and k * may tend to infinity as (p + k)/n → c ∈ [0, 1). Hence, it is expected that our proposed method will have a high probability of selecting the true subset where n is large regardless of the sizes of p, k and k * . Moreover, even when k is large under N − 4 > 0, our proposed method will be very fast although the full search methods used in like [4,27,28] cannot be calculable. In resent years, regularization methods are often used for estimating the regression coefficients. The lasso is famous as one of methods estimating the regression coefficients and selecting explanatory variables simultaneously. In multivariate linear regression, it is possible to select explanatory variables by the group lasso proposed by [29], and several papers (e.g., [11,18,26,30]) proposed regularization methods by extending the group lasso for multivariate linear regression case. Moreover, a generalized adaptive elastic-net was proposed and the consistency properties of the method were obtained by [26]. The consistency properties were provided by using asymptotic frameworks such that log k/ log n → ν ∈ [0, 1) or log k = o(n 1−2κ ) for some κ ∈ (0, 1/2). However, the properties are not ensured as p tends to infinity. Our method is consistent even when p tends to infinity as long as N → ∞. Further, our method is faster than an adaptive group lasso even when p or k are large. The remainder of the paper is organized as follows. In section 2, we present the necessary notation and assumptions to ensure consistency of our method. In section 3, we put forward the proposed method, explicate its consistency, and present a fast algorithm. We also propose an extended ZKB selection method. In section 4, we conduct numerical experiments for verification purposes. Technical details are relegated to the Appendix.

Preliminaries
First, we present the GC p criterion. Let S j be the unbiased estimator of Σ j in model (1.1), which is defined by where P j is the projection matrix to the subspace spanned by the columns of X j , i.e., P j = X j (X j X j ) −1 X j . Then, the GC p criterion in model (1.1) is defined by where α is a positive constant. The first and second terms in (2.1) express the residual sum of squares with the weighted matrix S −1 ω and α times the strength of the penalty for the number of elements of Θ j in model (1.1), respectively. Next, we present notation and assumptions to ensure consistency of our method. For a subset j ⊂ ω, let a p × p non-centrality matrix and parameter be denoted by where ω j = j c and j c denotes as ω\j. It should be emphasized that Δ j = O p,p and δ j = 0 hold if and only if j ⊂ j c * , where O p,p is a p × p matrix of zeros. To ensure the consistency of our method, the following three assumptions are prepared: Assumption A1. The true subset j * is included in the full set ω, i.e., j * ⊂ ω.
Assumption A1 is needed to consider consistency because the probability of selecting the true subset becomes 0 if it does not hold. Assumption A2 means that the minimum value among the sample variances of residuals resulting from the linear regression of x { } with the remaining X ω { } for ∈ j * is always positive and does not converge to 0. We often see an assumption for explanatory variables such that the inequality n −1 λ min (X X) ≥ c 1 , where λ min (A) is the minimum eigenvalue of a square matrix A. Assumption A2 is weaker than this assumption because the inequality min ∈j * Assumption A3 is a weak assumption for the true regression coefficients and the true covariance matrix. If c A < 1, Assumption A3 allows min ∈j * θ { } Σ −1 * θ { } to converge to 0. Moreover, for all = 1, . . . , k * , the following inequality holds (the proof is given in Appendix A.1): where θ * a is the ( , a)-th element of Θ * and σ 2 * a is the a-th diagonal element of Σ * . From (2.4), Assumption A3 can be rewritten as the assumption which does not rely on the correlations of response variables by replacing . If Assumptions A1-A3 are supported, the following inequality holds (the proof is given in Appendix A.1): where δ min = min ∈j * δ { } . The above equation restricts the divergence order of the non-centrality parameter δ { } . If k is fixed and c A = 1, (2.5) is as per what was put forward in [27]. Finally, we identify the upper bound of the rank of the non-centrality matrix Δ j , which is used to ensure consistency. For a subset j ⊂ ω (j = ω), let m j and d j be the number of elements of j and the rank of Δ j as follows: In accordance with [28], it follows from Assumption A1 that the rank of X * (P ω − P ωj )X * is calculated as .
Since m j ≤ k * holds when j ⊂ j * , the following inequality can be derived: . (2.7)

Proposed selection method
We define a class of the GC p criterion, denoted as the high-dimensionalityadjusted consistent generalized C p (HCGC p ) criterion: We now introduce the ZKB selection method using a variable selection criterion (SC). The best subset chosen by the ZKB selection method using an SC is written as The ZKB selection method is based on the idea that the value of the SC for the subset where a true variable is removed from ω will be greater than that for ω asymptotically. We define the following best subset chosen by the ZKB selection method using the HCGC p criterion:

Definition 3.2. The best subset chosen by the ZKB selection method using the HCGC p criterion is defined bŷ
Next, to use this method in actual empirical contexts we have to decide the value of α because the HCGC p criterion is expressed as the class of criteria. Hence, we show the following value of α: Thisα is based on [27]. It is straightforward to observe thatβ is satisfied with ( inβ, the criterion belongs to the class of the HCGC p criterion. However, the constant value plays a role in terms of stabilizing the behavior of Since the ZKB selection method using the GC p criterion only necessitates calculating the differences GC p (ω { } ) − GC p (ω) for = 1, . . . , k, it can be expected that the calculation time associated with this method will be shorter than that for the full search method. However, it is important that and the calculation time of an inverse matrix costs about the cube of the size of the matrix. Hence, it is not advisable to calculate (X ω { } X ω { } ) −1 for each when k is large. To overcome this problem, we offer an efficient calculation of Let r and z be the ( , )-th element of (X X) −1 and the -th column vector of X(X X) −1 , respectively. Then, using r and z , we can express P ω − P ω { } as follows (the proof of (3.4) is given in Appendix A.2): can be calculated. Moreover, the calculation cost of the product of each Y z relies on n. Hence, we also present an efficient calculation of z Y S −1 ω Y z when p is small. Let t be the -th column vector of S −1/2 ω Y X(X X) −1 . Then, the following equation can be derived: Since t is a p-dimensional vector, the calculation cost of t t does not rely on n. Therefore, we propose to use (3.5) (and also use (3.6) when p is small) to perform the ZKB selection method using the GC p criterion. Note that the proposed selection method is calculable when N − 4 > 0. When k > n, we can formally combine the proposed selection method and screening methods by [8,10,12], which can apply to screening explanatory variables for a multivariate linear regression with multiple responses. However, we should pay attention to use their methods because the screening properties are ensured when p or k * are fixed although the consistency of the proposed selection method is ensured even when p and k * may diverge. On the other hand, a multivariate linear regression can be regarded as a perfunctory linear regression on a single response from the explanatory matrix (I p ⊗ X). However, notice that in generally we cannot directly apply several screening methods (e.g., [2]) for a linear regression with a single response to our variable selection problem. This is because our selection problem can be regarded as a group selection problem for explanatory variables corresponding to the p-dimensional regression coefficient vectors.

Consistency of proposed selection method
We ensure the consistency of the ZKB selection method using the HCGC p criterion (3.2). To do so, we present a lemma about sufficient conditions for consistency (the proof is given in Appendix A.3). Importantly, Lemma 3.1 does not rely on a specific asymptotic framework, indeed any such framework could be applied here.
Lemma 3.1. Suppose that Assumption A1 and the following equations hold: Then, the ZKB selection method using the By showing that the sufficient conditions (3.7) and (3.8) in Lemma 3.1 hold, the consistency of the ZKB selection method using the HCGC p criterion (3.2) can be obtained as follows (the proof is given in Appendix A.4): From Theorem 3.1, the ZKB selection method using the HCGC p criterion with α =α given by (3.3) is also consistent under Assumptions A1, A2 and Assumption A3 with 3/4 < c A ≤ 1.

Extension of the ZKB selection method
In the previous subsections, we proposed the ZKB selection method using the HCGC p criterion (3.2). However, when the full model ω includes several explanatory variables such as multinomial variables, it will be not appropriate to use the ZKB selection method because whether such explanatory variables should be chosen or not should be decided simultaneously. To overcome this problem, we extend the ZKB selection method. Let J be a family of sets of some explanatory variables denoted by J = {j 1 , . . . , j q }, where q is the number of these sets. Since we suppose dummy variables or non-dummy variables as explanatory variables, we assume that m ja is finite, j a is satisfied with j a ⊂ j * or j a ⊂ j c * and j a ∩  6, 7} express the subsets of binomial and trinomial dummy variables, respectively. Using this notation, we consider the following best subset chosen by the extended ZKB (EZKB) selection method using an SC: We observe that the EZKB selection method is equivalent to the ZKB selection method (3.2) when m j = 1 (∀j ∈ J ) or q = k. The EZKB selection method can accommodate the selection of grouped explanatory variables. We define the following best subset chosen by the EZKB selection method using the HCGC p criterion: Definition 3.3. The best subset chosen by the EZKB selection method using the HCGC p criterion is defined bŷ (3.9) Next, we ensure the consistency of the EZKB selection method using the HCGC p criterion (3.9). Let J + = {j ∈ J | j ⊂ j * } and J − = {j ∈ J | j ⊂ j c * }. Then, as with Lemma 3.1, we present the following lemma about sufficient conditions for consistency (the proof is given in Appendix A.5).

Lemma 3.2. Suppose that Assumption A1 and the following equations hold:
Then, the EZKB selection method using the HCGC p criterion (3.9) is consistent.
Using Lemma 3.2, the consistency of the EZKB selection method using the HCGC p criterion (3.9) can be obtained as follows (the proof is given in Appendix A.6): Theorem 3.2. Suppose that Assumptions A1-A3 hold. Then, the EZKB selection method using the HCGC p criterion (3.9) is consistent as n → ∞, (p + k)/n → c ∈ [0, 1). From Theorem 3.2, we can observe that the EZKB selection method using the HCGC p criterion is also consistent as with the ZKB selection method (3.2). Hence, as an example of the consistent EZKB selection method under 3/4 < c A ≤ 1 in Assumption A3, we can use the method using the HCGC p criterion with α =α in (3.3).
Finally, we provide an efficient calculation of GC p (ω j ) − GC p (ω). Let R j and Z j be the m j × m j and n × m j matrices consisting of the row and column elements of (X X) −1 and the column vectors of X(X X) −1 indexed by the elements of j, respectively. For example, if j = {2, 5}, then R j and Z j are expressed as wherex ab is the (a, b)-element of (X X) −1 andz a is the a-th column vector of X(X X) −1 . Then, using R j and Z j , GC p (ω j ) − GC p (ω) can be expressed as The proof of the above equation is omitted because it essentially mimics (3.4).
Although (3.10) requires the calculation of the inverse matrix of R j , it will not be computationally onerous because the size is finite.

Numerical studies
We present numerical results to explore the validity of our claim based on Monte Carlo simulations with 1, 000 iterations executed in MATLAB 9.3.0 on a Panasonic CF-SV7UFKVS with an Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz 2.11 GHz and 16 GB of RAM. The probabilities of selecting the true subset and the CPU times are presented for the ZKB selection methods using the HCGC p criterion with α =α given in (3.3) and the three GC p criteria with α = 2, 2 log log n and log n (named GC 1 p , GC 2 p and GC 3 p ). The calculations were performed using (3.5) (and (3.6) if p < 100 and k ≥ p). The explanatory matrix X, the true coefficient matrix Θ * and the true covariance matrix Σ * were determined as follows: where Ψ is the k × k autoregressive matrix with the correlation ψ, i.e., (Ψ) ab = ψ |a−b| , and 1 p is a p-dimensional vector of ones. Further, we set ψ = 0.5, ξ 1 = 0.4 and ξ 2 = 0.8. Although Theorems 3.1 and 3.2 were obtained by assuming that Y is distributed according to the multivariate normal distribution under the true model, we also examine the probabilities under the non-normality in this simulation. Let E = (ε 1 , . . . , ε n ) be a n × p random matrix, where ε 1 , . . . , ε n are independent and identically distributed according to the multivariate t-distribution with 10 degrees of freedom, mean 0 p and covariance matrix (5/4)Σ * . Then, we constructed the following two true models: For comparison, we also calculated the probabilities of selecting the true subset and the CPU times using the adaptive group lasso (AGL) proposed by [25]. The estimator of Θ by the AGL is written aŝ where τ is a turning parameter, w a is the weight for the norm ||θ a || = (θ a θ a ) 1/2 , and θ a is the a-th column vector of Θ . Each column vector of Y and X in (4.1) is centralized and standardized. To optimize (4.1), we used a coordinate descent algorithm based on [6]. The algorithm is given as follows. Let 100 candidates of τ be τ t = exp{t log (τ max + 1)/(100 − 1)} − 1 (t ∈ {0, 1, 2, . . . , 99}), where For t = 1, . . . , i .
In our setting, we used ε AGL = 0.01, and w a was given by ||θ LSE where |A t | is the number of non-zero row vectors ofΘ τt , and α 1 = 2, α 2 = 2 log log n and α 3 = log n. We name the AGL using IC(τ t |α i ) (i = 1, 2, 3) as AGL 1 , AGL 2 and AGL 3 , respectively. Table 1 shows the probabilities of selecting the true subset by the ZKB selection methods using the HCGC p , GC i p (i = 1, 2, 3) denoted by HCGC p , GC i p (i = 1, 2, 3) and AGL i (i = 1, 2, 3) when Y is distributed according to the multivariate normal distribution under the true model. From Table 1, we observe that the selection method using the HCGC p criterion always exhibits high probabilities of selecting the true subset for all combinations of n, p, k and k * in Table 1. Although the probabilities by the method using the GC 3 p criterion also achieve 100%, the performance by the method using the HCGC p criterion is better than those when the GC 3 p criterion is used when the sample size is moderate. On the other hand, the probabilities by AGL 1 are low as the sample size increases in many cases. The probabilities by AGL 2 reach 100% only when the sample size is large and the dimensions are small. The probabilities by AGL 3 seem to increase slowly in some cases, but are low when k * is large. Table 2 shows the probabilities when Y is dis-   Table 2 are about the same as those in Table 1. Hence, it is expected that our results may hold even under the non-normality. The proofs of Theorems 3.1 and 3.2 in this paper are needed to calculate the moments of GC p (ω { } ) − GC p (ω) and GC p (ω j ) − GC p (ω), and we calculated them by assuming that Y is distributed according to the multivariate normal distribution under the true model. However, our results will be shown even under non-normality if another approach to the evaluation of the moments exists although we need to calculate the moments consisting of the inverse matrix S −1 ω . Table 3 shows the CPU times by the ZKB selection method using the HCGC p criterion denoted by HCGC p and AGL 3 when Y is distributed according to the multivariate normal distribution under the true model, and the former is faster than the latter. The difference is particularly clear when the dimensions are large. In sum, the ZKB selection method using the HCGC p criterion with α =α exhibits the highest probabilities of selecting the true subset and is faster than the AGLs.  First, we show (2.5). For an arbitrary ∈ j * , we have the following equation: From the above equation, Δ { } which is defined in (2.2) can be expressed as follows: Hence, we have

R. Oda and H. Yanagihara
The above equation leads to the following inequality: Next, we show (2.4). Let λ max (A) be the maximum eigenvalue of the square matrix A. Then, we have By using (e a Σ * e a ) −1/2 Σ 1/2 * e a as e, we have where e a is the p-dimensional vector such that the a-th element is one and the other elements are zero. The above equation completes the proof of (2.4).

A.2. Proof of equation (3.4)
Without loss of generality, let X = (X ω { } , X { } ) for an ∈ ω. Further, let R , r and r be satisfied with Then, using the general formula for the inverse of a block matrix (e.g., [7,Theorem 8.5.11]), P ω and P ω { } can be expressed as follows: From the above equations, we have Note that X(r , r ) is the k-th column vector of X(X X) −1 . Therefore, (3.4) can be derived.

A.3. Proof of Lemma 3.1
We can express P (ĵ = j * ) as follows: Then, the following lower bound of P (ĵ = j * ) can be derived: This completes the proof of Lemma 3.1.

A.4. Proof of Theorem 3.1
We first describe two lemmas. The first lemma gives another expression of GC p (ω j ) − GC p (ω) for j ⊂ ω (j = ω) (the proof is given in Appendix A.7): where Δ j and m j are defined by (2.2) and (2.6), and λ max (Δ j ) is the maximum eigenvalue of Δ j . Let u i , u j,i and v i be random variables distributed according to u i ∼ χ 2 (p), u j,i ∼ χ 2 (p; δ j,i ) and v i ∼ χ 2 (n − p − k + 1) (i = 1, . . . , m j ), where u i and u j,i are independent of v i for each i. Then, under Assumption A1, we have The following lemma is needed to evaluate the divergence orders of the moments of GC p (ω j ) − GC p (ω) (the proof is given in Appendix A.8).
Lemma A.2. Let δ be a positive constant. And let u 1 , u 2 and v be random variables distributed according to χ 2 (p), χ 2 (p; δ) and χ 2 (N ), where u 1 and u 2 are independent of v, and N = n − p − k + 1. Then, for N − 4r > 0 (r ∈ N), we have Applying the results of Lemma A.1 for m j = 1 to HCGC p (ω { } )−HCGC p (ω), we have where u and u are independent of v, and u ∼ Applying Markov's inequality to (A.3) and (A.4), the following upper bounds can be derived: where r is a natural number defined by (3.1) andr are any natural numbers. From the above equations and Lemma A.2, the following equation can be derived: Moreover, for sufficiently larger, since k * prn −2rc These equations and Lemma 3.1 complete the proof of Theorem 3.1.

A.5. Proof of Lemma 3.2
We can express P (ĵ J = j * ) as follows: Then, the following lower bound of P (ĵ J = j * ) can be derived: Therefore, Lemma 3.2 can be derived.

A.6. Proof of Theorem 3.2
We can apply the results of Lemma A.1 to this proof, i.e., we can express the following distribution forms of HCGC p (ω j ) − HCGC p (ω): where u i and u j,i are independent of v i , and Here, δ j,i (i = 1, . . . , m j ) are constants satisfying where Δ j is given by (2.2). When j ∈ J + , let be an element of j, i.e., ∈ j. Then, since I n − P ω { } and P ω { } − P ωj are semi-positive definite, the following equation can be derived: In addition, let d j = rank(Δ j ) which is defined by (2.6). From (2.7), we observe that d j is bounded. Since d j λ max (Δ j ) ≥ tr(Δ j ) holds, the following inequality is obtained: Now, we derive the divergence orders of j∈J− P (HCGC p (ω j ) >HCGC p (ω)) and j∈J+ P (HCGC p (ω j ) < HCGC p (ω)). From (2.5), (3.1), (A.5) and (A.6), when N is sufficiently large, we have where ρ = {p/(n − k)}β. Then, by applying Markov's inequality to (A.7) and (A.8), their following upper bounds can be derived: wherer are any natural numbers. Hence, from the above equations and Lemma A.2, the following equations can be derived: Note that m j is bounded and #(J + ) ≤ k * , and it follows from (A.6) that δ −1 j,i ≤ m j d j δ −1 { } . Hence, for sufficiently larger, we have O (p + δ j,i )rδ −2r j,i + (p + δ j,i ) 2r δ −2r j,i n −r = o(1).
Therefore, from Lemma 3.2, Theorem 3.2 can be shown.

A.7. Proof of Lemma A.1
First, we derive results for the case of j ⊂ j c * . Denote the elements of j as a 1 , . . . , a mj satisfying a s = a t (s = t), i.e., j = {a 1 , . . . , a mj }. Further, let j −,0 = ω j and j −,i = j −,i−1 ∪{a i } (i = 1, . . . , m j ). Then, it holds that j −,mj = ω, and we can express GC p (ω j ) − GC p (ω) as follows: Note that P j−,i − P j−,i−1 and I n − P ω are symmetric idempotent matrices, and it holds that (P j−,i − P j−,i−1 )(I n − P ω ) = O n,n and (P j−,i − P j−,i−1 )X * = (I n − P ω )X * = O n,k * . Then, from a property of the Wishart distribution and Cochran's Theorem (e.g., [ From a property of the Wishart distribution, W j,i can be expressed as W j,i = z i z i , where z i is independent of W , and z i ∼ N p (0 p , I p ). Then, we express z i W −1 z i as Then, from a property of the Wishart distribution, we can state that u i and v i are independent, and u i ∼ χ 2 (p) and v i ∼ χ 2 (n − p − k + 1). Therefore, tr(W j,i W −1 ) is expressed as Hence, from the multinomial theorem, we have . Therefore, we can derive the divergence orders as follows: = O(max{(p + δ) r n −2r , (p + δ) 2r n −3r }).

A.9. Proof of Lemma A.3
We elaborate only on the case of h ≥ 2 because it is straightforward when h = 0, 1. First, we derive (A.13) and (A.14). Let h 1 , . . . , h d be natural numbers satisfying d i=1 h i = h and 2 ≤ h 1 , . . . , h d . From [22], we can state that h-th central moments can be expressed as the linear combination of the products of h 1 , . . . , h d -th cumulants. From [9,23], h-th cumulants of X 1 − t and X 2 − (t + ψ) can, respectively, be expressed as follows: Then, we observe that the maximum order term of each h-th central moment is κ h/2 2,i if h is even and κ (h−1)/2−1 2,i κ 3,i if h is odd (i = 1, 2). This completes (A.13) and (A.14).
Next, we derive (A.15). From the multinomial theorem, we have