A consistency property of the AIC for multivariate linear models when the dimension and the sample size are large

: It is common knowledge that Akaike’s information criterion (AIC) is not a consistent model selection criterion, and Bayesian information criterion (BIC) is. These have been conﬁrmed from an asymptotic selection probability evaluated from a large-sample framework. However, when a high-dimensional asymptotic framework, such that the dimension of the response variables and the sample size are approaching ∞ , is used for evaluating the selection probability, there are cases that the AIC for selecting variables in multivariate linear models is consistent, but the BIC is not. The AIC and BIC are included in a family of information criteria de- ﬁned by adding a penalty term expressing the complexity of the model to a negative twofold maximum log-likelihood. By clarifying the condition of the penalty term to ensure the consistency, we derive conditions for consistency of the AIC, BIC and other information criteria under the high-dimensional asymptotic framework.


Introduction
Let Y be an n × p observation matrix of p response variables, and let X be an n × k observation matrix of k nonstochastic explanatory variables, where n is the sample size, and it is assumed that n − p − k − 1 > 0. In order to ensure the 870 H. Yanagihara et al. possibility of estimating the model, we also assume that rank(X) = k (< n). Suppose that j denotes a subset of ω = {1, . . . , k} containing k j elements, and X j denotes the n × k j matrix consisting of the columns of X indexed by the elements of j. For example, if j = {1, 2, 4}, then X j consists of the first, second, and fourth columns of X. Of course, it holds that X ω = X and k ω = k. Also, we let k A denote the number of elements of a set A, i.e., k A = #(A). Then the following multivariate linear regression model with k j explanatory variables is considered as the candidate model: where Θ j is a k j × p unknown matrix of regression coefficients, and Σ j is a p × p unknown covariance matrix. Here, A ⊗ B denotes the Kronecker product of an m × n matrix A and a p × q matrix B, which is an mp × nq matrix defined by where a ij is the (i, j)th element of A (see, e.g., [18, chap. 16]). In particular, the model with X ω (namely X) is called the full model. We will assume that the data are generated from the following true model: where j * is a set of integers indicating the subset of explanatory variables in the true model. Henceforth, for simplicity, we represent X j * and k j * as X * and k * , respectively.
The multivariate linear regression model of (1.1) is one of basic models of multivariate analysis. This model is introduced in many multivariate statistical textbooks (see, e.g., [31, chap. 9], [33, chap. 4]), and even now is widely used in chemometrics, engineering, econometrics, psychometrics, and many other fields, for the prediction of multiple responses from a set of explanatory variables (see, e.g., [10,24,25,40]). Since it is important to specify factors affecting response variables in regression analysis, searching for the optimal subset j is essential.
Akaike's information criterion (AIC), proposed by [1,2], is widely used for selecting the best model. The AIC was proposed as an asymptotic unbiased estimator of the risk function assessed by the expected Kullback-Leibler (KL) loss [20] under the assumption that the candidate model includes the true model. One purpose of a model selection method based on the AIC is to choose a model that makes the risk function small. For that purpose, using the AIC for model selection will be asymptotically efficient when the true model is infinite (see, e.g., [27,29,39]). A Bayesian information criterion (BIC) proposed by [26] and a consistent AIC (CAIC) proposed by [6] are also widely used for model selection purposes. It is a well-known fact that, when the true model is included in a set of the candidate models, these two criteria are consistent in model selection, i.e., the probability of selecting the true model goes to 1 asymptotically, although for the AIC is not. When using the AIC for model selection, this inconsistency property sometimes becomes a target for criticism, although the purpose of the AIC is not to choose the true model. The inconsistency property of the AIC is confirmed from the asymptotic probability of selecting the model, which is evaluated from the following asymptotic framework that represents an ordinary asymptotic procedure [11,12,22,28]: • A large-sample (LS) asymptotic framework: the sample size is approaching ∞ under a fixed number of parameters. In this paper, lim n→∞ means a limit as n → ∞ under the condition that the number of parameters is fixed.
In the case of multivariate linear models, although there are many bias-corrected AICs for the risk function (see, e.g., [3,14,17,37,38]), such a bias-corrected AIC is still not consistent for model selection.
In recent years, high-dimensional data analysis has been attracting the attention of many researchers. It is known that the LS asymptotic framework gives a poor approximation when the dimension is large. However, the following asymptotic framework gives a better approximation than the LS asymptotic framework when the dimension and the sample size are large, and sometimes even when the dimension is not so large [13,15,16]: • A high-dimensional (HD) asymptotic framework: the sample size and the dimension of the response variables simultaneously approach ∞ under the condition that c n,p = p/n → c 0 ∈ [0, 1). For simplicity, we will write "(n, p) → ∞ simultaneously under the condition that c n,p → c 0 " as "c n,p → c 0 ", and lim cn,p→c0 means a limit under the HD asymptotic framework. It should be emphasized that we assume that p always goes to ∞ in the HD asymptotic framework. Hence, the notation c n,p → 0 does not mean the LS asymptotic framework.
When the HD asymptotic framework is used for evaluating the asymptotic probability of selecting the true model, there is a possibility that the AIC can become consistent. In fact, in this paper, we will prove that a variable selection method based on the AIC becomes consistent in multivariate linear models under a HD asymptotic framework. The AIC is included in a family of information criteria defined by adding a penalty term expressing the complexity of the model to a negative twofold maximum likelihood. By clarifying the condition of the penalty term to satisfy the consistency property, we will also prove that a variable selection method based on the bias-corrected AIC (AIC c ), as proposed by [3], becomes consistent under more non-restrictive situation than that based on the AIC, and those based on the BIC and the CAIC are not necessarily consistent when c 0 ∈ (0, 1). Additionally, we derive a sufficient condition to satisfy the consistency of the family of information criteria under an asymptotic framework such that the number of candidate models may approach ∞.
In this paper, o(x), O(x), o p (x), and O p (x) used in a vector or matrix having finite dimension or size mean that the orders of all the elements in that vector or matrix are o(x), O(x), o p (x), and O p (x), respectively. Furthermore, the Landau notations indicate the orders as n → ∞ under a fixed number of parameters when the LS asymptotic framework is considered. Meanwhile, those Landau notations are also used for the orders as c n,p → c 0 when the HD asymptotic framework is considered. As stated already, we deal with not a strong consistency but a weak consistency. Hence, throughout the paper, the word "consistency" means weak consistency.
The remainder of the paper is organized as follows: In Section 2, we present the necessary notation for evaluating an asymptotic selection probability. In Section 3, the asymptotic probability of selecting the true model is calculated under the HD asymptotic framework. In Section 4, we compare with variable selection methods based on the AIC, AIC c , BIC and CAIC by conducting numerical experiments. In Section 5, we discuss our conclusions. Technical details are provided in Appendix.

Preliminaries
In this section, we present and discuss the notation that we used for evaluating the asymptotic selection probability. First, we describe several classes of the set j. Let J be a set of candidate models denoted by J = {j 1 , . . . , j K }, where K is the number of candidate models. We then separate J into two sets, one of which is a set of overspecified models, candidate models that include the true model, i.e., J + = {j ∈ J |j * ⊆ j}, and the other is a set of underspecified models that are not the overspecified models, i.e., J − = J c + ∩ J . We use the same terminology, "overspecified model" and "underspecified model", as was used by [14].
Estimations for the unknown parameters Θ j and Σ j in the model (1.1) are carried out by the maximum likelihood method, i.e., Θ j and Σ j are estimated byΘ where P j is the projection matrix to the subspace spanned by the columns of X j , i.e., P j = X j (X ′ j X j ) −1 X ′ j . A family of information criteria in the model where L(j) = n log det(Σ j ) and m(j) is a positive constant expressing a penalty for the complexity of the model (1.1). An information criterion included in this family is specified by an individual penalty term m(j). This family contains AIC, AIC c , BIC and CAIC as a special case. .

(2.2)
When p = 1, the AIC c coincides with the bias-corrected AIC proposed by [32]. [9] showed that Sugiura's bias-corrected AIC is a uniformly minimum-variance unbiased estimator (UMVUE) of the risk function consisting of the expected KL loss when the candidate model includes the true model. By extending the result to the multivariate case, this property can be proved even when p > 1. The detailed proof is omitted because it can be obtained from the Lehman-Scheffé theorem and the fact thatΘ j andΣ j are complete sufficient statistics. Complete efficiencies ofΘ j andΣ j can be derived by slightly modifying the results of [30, pp. 18-20]. This property indicates that, for all the overspecified models, the AIC c is better than the AIC at estimating the risk function. The best subsets of ω is chosen by minimizing IC m (j), i.e., it is presented aŝ Next, we describe a noncentrality matrix that plays a critical role for proving consistency. In fact, asymptotic behaviors of elements or eigen values of a noncentrality matrix are one of important factors that determines whether an information criterion is consistent or not. The noncentrality matrix is defined by Σ In order to decompose the noncentrality matrix, the minimum overspecified model including j is prepared as If j * is arranged as j * = {{j * ∩ j}, {j * ∩ j c }}, (I n − P j )X * = (O n,kj * ∩j , (I n − P j )X j * ∩j c ) is satisfied, where O k,p is a k × p zero matrix. It is easy to see that X j * ∩j c is a full column rank matrix because it is assumed that X is the full column rank matrix. Hence, the rank of X ′ * (I n − P j )X * is calculated as This indicates that the rank of X ′ * (I n − P j )X * is independent of p if k * is an independent of p. Let the rank of the noncentrality matrix be denoted by γ j . It follows from the inequality rank(Θ * Σ −1 * Θ ′ * ) ≤ min{p, k * } and a knowledge of an elementary linear algebra that It notes that γ j = k j+ −k j if Θ * is a full row rank matrix. Since the noncentrality matrix is a positive semidefinite matrix, and its rank is γ j , it is decomposed as where Γ j is a p×γ j matrix. Γ j is a full column rank matrix in the case of large p, at least p ≥ k * . If we assume that the orders of elements of X ′ X are O(n) and elements of Θ * and Σ * are independent of n, which are common assumptions in papers dealing with an asymptotic theory on the regression model [14,17], Hence, if we assume that all the orders of the elements of Γ j Γ ′ j are O(n), all the orders of the elements of Γ ′ j Γ j are uniformly equal, and γ j is constant, then all the orders of the elements of Γ ′ j Γ j are O(np). From this fact and the inequality In order to evaluate the probability of selecting the model j by the IC m , we introduce the following assumptions: Assumption 1. The true model is included in the set of candidate models, i.e., j * ∈ J . Assumption 2. lim n→∞ n −1 X ′ X = R 0 exists and is positive definite, and lim n→∞ n −1 Γ j Γ ′ j = Ψ j,0 exists and is not the zero matrix for all j ∈ J − . Assumption 3. For all j ∈ J − , γ j is constant, and lim sup cn,p→c0 (np) −1 λ j,1 < ∞ and lim inf cn,p→c0 (np) −1 λ j,γj > 0.
For R 0 in assumption 2, we write a limiting value of n −1 X ′ j X ℓ as R j,ℓ,0 for j, ℓ ∈ J . It is clear that R j,ℓ,0 is a submatrix of R 0 , and R j,ℓ,0 also exists if R 0 exists. Moreover, it notes that Ψ j,0 still depends on p because Ψ j,0 is the convergent value under the LS asymptotic framework.

Main results
In this section, we evaluate an asymptotic probability of selecting a model by the IC m in (2.1). First, we describe the asymptotic selection probabilities of selecting the true model j * under the ordinary asymptotic framework, i.e., the LS asymptotic framework. Using the ideas of [11,12,22,28], we obtain the following Theorem 3.1 (the proof is given in Appendix A.1): Theorem 3.1. Suppose that assumptions 1 and 2 hold. A variable selection method based on the IC m is consistent when n → ∞ if the following conditions are satisfied simultaneously: If one of the above two conditions is not satisfied, a variable selection method based on the IC m is not consistent when n → ∞. Additionally, when m(j) = O(1) as n → ∞ and lim n→∞ {m(j) − m(ℓ)} = pm 0 (k j − k ℓ ) for all j, ℓ ∈ J + , the asymptotic probability of selecting the model j by the IC m is These results include the results of [22,35] etc. as a special case for p = 1. Theorem 3.1 points out a well-known fact that, when n → ∞, the AIC and the AIC c are not consistent and the BIC and the CAIC are consistent in model selection. However, when behaviors of the information criteria are evaluated under the HD framework, we obtain new properties, as in Theorem 3.2 (the proof is given in Appendix A.2).
Theorem 3.2. Suppose that assumptions 1 and 3 are satisfied. Then, a variable selection method based on the IC m is consistent when c n,p → c 0 if the following conditions are satisfied simultaneously: If the sign ">" becomes "<" in one of the above two conditions, a variable selection method based on the IC m is not consistent when c n,p → c 0 .
It notes that lim c→0 c −1 log(1 − c) = −1 and c −1 log(1 − c) is a monotonically decreasing function in 0 ≤ c < 1. From Theorem 3.2, consistency properties of specific criteria are clarified as the following corollary (the proof is given in Appendix A.4): Corollary 3.1 shows that, there is no restriction of c 0 in the condition for consistency of the AIC c although it is restricted in the AIC. This indicates that it is possible that the bias correction to the risk function has a positive effect on selection of the true model. Moreover, Corollary 3.1 indicates that the BIC and the CAIC are not always consistent in variable selection when c n,p → c 0 . If Θ * is the full row rank matrix, γ j becomes k j+ −k j . Since c 0 < 1 and In contrast, if c 0 = 0 then γ j > c 0 (k * − k j ) is satisfied. Therefore, we can see that variable selection methods based on the BIC and the CAIC are consistent as c n,p → c 0 if Θ * is the full row rank matrix, or c n,p converges to 0. However, if Θ * is not the full row rank matrix and c 0 ∈ (0, 1), we cannot determine as if variable selection methods based on the BIC and the CAIC are consistent as c n,p → c 0 .
In order to clarify the condition to ensure inconsistency, assumption 3 is assumed in Theorem 3.2, i.e., we assume that the orders of eigen values of Γ ′ j Γ j are uniformly the same and γ j is independent of n and p. If the aim is only to derive a sufficient condition for consistency, such a strong assumption like assumption 3 is unnecessary. In fact, for evaluating consistency, there is no need to assume the same orders for all the eigen values of Γ ′ j Γ j . Whether an information criterion is consistent strongly depends on the orders of divergence speeds of several eigen values. Hence, we clarify condition of the orders of divergence speeds of eigen values of Γ ′ j Γ j to ensure a consistency. In addition, most recently, many researchers pay close attention to "big data analysis", and thus study on a theory of a variable selection when the number of candidate models approaches ∞ (see, e.g., [19]). Hence, we derive a sufficient condition for the consistency by using the following asymptotic framework: • A high-dimensional and large-model (HD-LM) asymptotic framework: the HD asymptotic framework under the condition that the following equations are satisfied: where K is the number of candidate models. In the HD-LM asymptotic framework, p always goes to ∞, and it makes no difference whether K is constant or K goes to ∞. This indicates that the HD asymptotic framework is a special case of the HD-LM asymptotic framework. For simplicity, we will write "(n, p) → ∞ simultaneously under the HD-LM asymptotic framework" as "c n,p → c 0 under LM", and lim cn,p→c0,LM and lim inf cn,p→c0,LM mean a limit and a limit inferior under the HD-LM asymptotic framework, respectively. C3-1. For sufficiently large n, there exist positive constants δ 1 , δ 2 such that for all j ∈ J − there exists an integer q ∈ [1, γ j ] such that λ j,q /q 2 > n δ1 and where j + is given by (2.3), γ j = rank(Γ j ), and β j,q is the geometric mean of the largest q eigen values of Γ j Γ ′ j , i.e., C3-2. For sufficiently large n, there exists a positive constant δ such that for all j ∈ J + \{j * }, The proof is given in Appendix A.5. Roughly speaking, the existence of δ 2 in condition C3-1 is related to the order of noncentrality matrix. Assumption 3 is equivalent to the condition that the orders of all the eigen values of Γ ′ j Γ j are O(np). However, the condition λ j,q /q 2 > n δ1 indicates that the orders of the eigen values of Γ ′ j Γ j do not need to be the same orders uniformly. Moreover, in Theorem 3.3, k * , γ j and K do not have to be bounded. Hence, Theorem 3.3 can be applied to more non-restrictive situations than Theorem 3.2.
Although we have derived sufficient conditions for consistency in Theorem 3.3, it is hard to check from the conditions whether an information criterion considered is consistent. Hence, in order to establish an easy-to understand formula, we rewrite the conditions by using a limit inferior. Besides, by using 1 as q, we simplify condition C3-1 although the sufficient conditions become restrictive.
. The proof is given in Appendix A.9. It notes that not "min" but "inf" is used for conditions C3-1 ′ and -2 because the number of candidate models may go to ∞. Although we cannot check whether condition C3-1 ′ is satisfied from an actual data, we can derive the order of the divergence speed of the maximum eigen value of Γ ′ j Γ j to ensure a consistency. If the size of the order is small, we can consider that a possibility that an information criterion is consistent is high.
Hence, the size of the order helps to assess a quality of an information criterion in the sense of a possibility to have consistency.
Even if k * is not bounded, Theorem 3.3 and Corollary 3.2 hold. However, in order to clarify the sufficient conditions to ensure consistency, we consider the simple case that k * is bounded. Then, conditions to satisfy consistency properties of specific criteria are simplified as the following corollary (the proof is given in Appendix A.10): Corollary 3.3. Suppose that assumption 1 is satisfied, and k * is bounded.
where c a is the constant given by (3.2). (ii) A variable selection method based on the AIC c is consistent when An example of the noncentrality matrix is shown in Appendix A.11. From Corollary 3.3, we can see that the AIC c is consistent if lim cn,p→c0,LM log(λ j,1 /n) = ∞, and the BIC and the CAIC are consistent if lim cn,p→c0,LM log(λ j,1 /n)/ log n = ∞. Moreover, the AIC is consistent if c 0 < c a and lim cn,p→c0,LM log(λ j,1 / n) = ∞. Hence, the AIC and the AIC c has a superiority over the BIC and the CAIC in the sense of a possibility to have a consistency. Moreover, although the AIC is consistent under the restriction c 0 < c a , there is no such a restriction in AIC c . Consequently, we can judge that the AIC c has a superiority over the AIC, BIC and CAIC in the sense of a possibility to have a consistency.

Numerical study
In this section, we compare with the probabilities of selecting the true model by AIC, AIC c , BIC and CAIC in (2.2), which were evaluated by Monte Carlo simulations based on 10,000 replications under several different values of n and p. A set of candidate models was J = {j 1 , . . . , j k }, where j α = {1, . . . , α} (α = 1, . . . , k). A 1000 × 156 matrix M Φ(156) 1/2 was generated, where an each element of M was independent and identically chosen from U (−1, 1), and Φ(q) is a q × q symmetric matrix whose the (a, b)th element was defined by (0.8) |a−b| . Using this matrix, we constructed an n× k matrix of explanatory variables X as  In our numerical study, γ j = 1 and max j∈S− (k * − k j ) = 4 hold. This implies that when c 0 > 1/4, the inequality γ j > c 0 (k * − k j ) was not always satisfied for all j ∈ S − . Thus, the BIC and the CAIC were not consistent in variable selection when c 0 > 1/4 under the fixed k. Table 1 shows the probability of selecting the true model by the AIC, AIC c , BIC, and CAIC. For n = ∞ or p = ∞, we list the theoretical values obtained from Theorems 3.1, 3.2 and 3.3. A symbol "-" means that theoretical values are unclear because the sufficient condition for consistency does not hold. In the table, Cases 1, 3, and 5 are the results when n → ∞ under fixed p and k = 10, and Cases 2, 4, 6, 7, and 8 are the results when (n, p) → ∞ under a fixed k = 10 and with c 0 = 0.02, 0.1, 0.3, 0.0, and 0.0. Moreover, Cases 9, 10, 11 and 12 are the results when (n, p) → ∞ and with c 0 = 0.1 and 0.3, and k = 10 + [n 1/2 − 10 1/2 ] and k = 10 + [n 3/4 − 10 3/4 ], where [ ] is the Gauss' symbol. From the table, we can see that in the cases of the AIC and the AIC c , the greater the dimension and sample size considered, the greater the probabilities became. Compared with the results obtained from the AIC and the AIC c , probabilities by the AIC c tended to be higher than those by the AIC when n was not small. In the cases of the BIC and the CAIC, the greater the dimension and sample size considered, the higher the selection probabilities became, with the exception of Case 6. This was because variable selection methods based on the BIC and the CAIC were not consistent in Case 6. Additionally, when n was small and p was large, the selection probabilities of the BIC and the CAIC were both very low. However, if the BIC and the CAIC were consistent in variable selection, these probabilities became high as n and p increased. Moreover, we can see that above tendencies were satisfied even if the number of explanatory variables becomes large.
We simulated several other models and obtained similar results. Since the theoretical difference between using the AIC and the AIC c occurs when c n,p > 0.8, we should list the numerical results for such a case. However, when c n,p is close to 1, the convergence of selection probabilities was extremely slow. Thus, we do not show simulation results for dimensions close to the sample size.

Conclusion and discussion
In this paper, we demonstrated that there is the case that the AIC for the multivariate linear regression model is consistent in variable selection when we approximate the probability of selecting the true model using the HD asymptotic framework. Although the AIC becomes consistent under the restriction c 0 < c a , the AIC c becomes consistent without the restriction of c 0 . This indicates that it is possible that correcting the bias to the risk function may have a positive effect on the selection of the true model. It is a well-known fact that variable selection methods based on the BIC and the CAIC are consistent if we approximate the probability of selecting the true model using the LS asymptotic framework. However, we found that there is a possibility that the BIC and the CAIC become inconsistent if we approximate the probability of selecting the true model using the HD asymptotic framework.
It is known that the LS asymptotic theory gives a poor approximation when the dimension is large. The HD asymptotic theory gives a better approximation than the LS asymptotic theory when the sample size and the dimension are large, and sometimes even when the dimension is not so large. Hence, the consistency property of the AIC that we demonstrated will be useful for highdimensional data analysis. Usually, the HD asymptotic theory is used to improve the approximations of the distributions of statistics. However, the results in this paper suggest a possibility that new insight can be provided by applying the HD asymptotic theory to high-dimensional data.
From the simulation study, we found that, the larger the dimension and sample size considered, the higher the selection probabilities became. This numerical result naturally implies that using multiple response variables at the same time as the model selection can increase the probability of selecting the true model. In other words, we should not select variables using only each response variable. That is a strong reason to apply the model selection procedure based on the multivariate linear regression model to high-dimensional data.
In this paper, we considered the case of n > p becauseΣ j becomes singular when p > n. Unfortunately, n > p is not always satisfied in the actual data. If our results can be extended to the case of n ≤ p, we clarify the conditions to satisfy consistency property in many infinite-dimensional statistics, e.g., the time series analysis (see [4,5]), spatiotemporal geostatistical analysis (see [7,8]) and functional data analysis (see [23]). The singularity ofΣ j can be avoided by using a ridge-type estimator of the covariance matrix, as demonstrated by [36]. We can expect that an AIC consisting of such a ridge-type estimator will be consistent in model selection. Recall that the LS asymptotic framework is used for proving Theorem 3.1. We can see thatΣ j p → Σ * as n → ∞ holds when j ∈ J + andΣ j p → Σ 1/2 * Ψ j,0 Σ 1/2 * + Σ * as n → ∞ holds when j ∈ J − , where Ψ j,0 = lim n→∞ n −1 Γ j Γ ′ j and Γ j is given by (2.4). Notice that Ψ j,0 is a positive semidefinite matrix. When lim n→∞ {m(j) − m(j * )}/n = 0 for all j ∈ J − , we have This result implies that lim n→∞ P (IC m (j * ) > IC m (j)) = 0 for any j ∈ J − . Thus, we obtain From here to the end of proof, we assume j, ℓ ∈ J + . Let V and Z j be the p × p and the k j × p matrices defined by It is well known that V has an asymptotic normality as n → ∞, and Z j ∼ N kj ×p (O kj ,p , I kj p ). Furthermore, using From the above expression, the first term of the IC m (j) can be expanded as . Let z j be a k j p-dimensional random vector defined by z j = vec(Z j ), where vec(A) is an operator that transforms a matrix to a vector by stacking the first to the last columns of A, i.e., vec(A) = (a ′ 1 , . . . , a ′ m ) ′ when A = (a 1 , . . . , a m ) (see, e.g., [18, chap. 16.2]). Then, it follows from the expansion and the equality tr Hence, when lim n→∞ {m(j) − m(j * )} = ∞ holds for all j ∈ J + \{j * } we derive At first, we describe the lemma which is used for proving Theorems 3.2 and 3.3 (the proof of lemma is given after this subsection).
Lemma A.1. Let T = −l(pq) −1 log Λ where Λ is distributed according to the Wilks' lambda distribution Λ q (p, l + q), and let κ (s) T be the sth order cumulant of T . Suppose that p/l → α (constant) and q/l → 0. If α > 0 then If α = 0 then κ Recall that the HD asymptotic framework is used for proving Theorem 3.2. First, we consider the case of j ∈ J − . Let Notice that rank(A j ) = γ j because A ′ j A j = Γ j Γ ′ j and rank(Γ j ) = γ j , where Γ j is given by (2.4). By using a singular value decomposition, A j can be rewritten as where H j and G j are n × γ j and p × γ j matrices satisfying H ′ j H j = I γj and G ′ j G j = I γj , respectively, and L j is a γ j × γ j diagonal matrix whose diagonal elements are squared singular values of A j . By using A j and E given by (A.2), we have It follows from the equations P j * X * = X * and P j P j+ = P j that Using this result and A ′ j P j = O p,n yields H ′ j P j = O γj ,n and H ′ j P j+ = H ′ j . These imply that From the above results and the multivariate version of the Cochran theorem (see, e.g., [30, chap. 2.8]), we can see that W 1 , W 2 , and W 3 are p × p mutually independent random matrices distributed according to the Wishart or the noncentral Wishart distributions; where U 1 , U 2 , U 3 , and U 4 are random matrices distributed according to the Wishart or the noncentral Wishart distributions; Here, U 1 and U 2 are mutually independent, and U 3 and U 4 are also mutually independent. When c n,p → c 0 ∈ [0, 1), we have From the definition of the noncentral Wishart distribution, a different expression of U 2 is given as Recall that that lim sup cn,p→c0 npλ −1 j,γ < ∞ is derived from assumption 3. It follows from the above result and E[Z ′ Z] = pI γj that This equation implies that These equations imply that From (A.10), (A.11) and (A.12), we derive the convergence in probability of Notice that This equation implies that Combining the equations (A.9), (A.13) and (A.14) yields Using the results of the convergence in probability, the first and second terms in (A.7) are expanded as Using the same idea as in the derivation of (A. 16), it can be shown that From the results (A.17) and (A.18), when condition C2-1 holds, the difference between the information criteria of the model j and the true model j * is con- Recall that the HD-LM asymptotic framework is used for proving Theorem 3.3. First, we consider the case of j ∈ J − . Let d j = k j+ − k j . As in the proof of Theorem 3.2, represent where W 1 and W 2 are independent, also U 1 and U 2 are independent, and with Ω j = diag(λ j,1 , . . . , λ j,dj ). It should be kept in mind that we recycle some notations to denote different random matrices from those in the proof of Theorem 3.2. Let q be the integer for j in condition C3-1. Express where Z and U 3 are mutually independent randam matrices defined by Here, Γ j,q = (Ω 1/2 j,q , O q,dj −q ) ′ and Ω j,q = diag(λ j,1 , . . . , λ j,q ). Then where We can show that V 1 and Z are independent, and V 1 ∼ W q (n − p − k j+ + q, I q ) (see [16, p. 57 th. 3.2.4]). Let δ 3 = δ 2 /2 and h = 1−exp(−δ 2 /2). Then 0 < h < 1 and If an event

(A.29)
Since tr{(Γ ′ j,q Γ j,q ) −1/2 Γ ′ j,q ZZ ′ Γ j,q (Γ ′ j,q Γ j,q ) −1/2 } is distributed according to the chi-square distribution with q 2 degrees of freedom, using Lemma A.2 and condition C3-1 we obtain for sufficiently large n. Using Lemma A.3, we obtain P (log{n − p − k j+ + (q + 1)/2} + δ 3 < q −1 log det(V 1 )) Next, we consider the case of j ∈ J + \{j * }. Let Λ be a random variable distributed according to Λ rj (p, n − k * − p), which is given in (A.20). where Γ(x) is the gamma function. Taking the minimum with respect to r we get the first inequality. The second inequality can be obtained by the fact that 1 + log x < x.

A.7. The proof of Lemma A.3
At first, we describe the lemma which is used for proving Lemma A.3 (the proof of lemma is given after this subsection).
Using the fact that det(V ) ∼ q i=1 χ 2 n−p+i , where χ 2 n−p+1 , . . . , χ 2 n−p+q are independent random variables and χ 2 n−p+i is distributed according to the chisquare distribution with n − p + i degrees of freedom (see [21, p. 100 th. 3.2.15]), the moment generating function of log det(V ) is given by where ψ (s) (a) is the sth order derivative of ψ(a). Hence the first order cumulant of log det(V ) is given by Using Lemma A.4 and the fact that log x is increasing and concave function of x, q log(n − q) < q i=1 log(n − q + i − 1) < κ (1) log(n − q + i) < q log n − q + q + 1 2 .
Since f (x, y) = 2 s (n − q + x + 2y) −s is a decreasing and convex function of x and y, Calculating the integrals and taking the limits, we obtain (A.24). The 2lth order central moment is the sum of the products of cumulants, κ (s1) T · · · κ (s k ) T such that s 1 + · · · + s k = 2l. Hence the order of the 2lth order central moment is equal to the order of (κ