Estimation of covariance and precision matrices under scale-invariant quadratic loss in high dimension

: The problem of estimating covariance and precision matrices of multivariate normal distributions is addressed when both the sample size and the dimension of variables are large. The estimation of the precision ma- trix is important in various statistical inference including the Fisher linear discriminant analysis, conﬁdence region based on the Mahalanobis distance and others. A standard estimator is the inverse of the sample covariance matrix, but it may be instable or can not be deﬁned in the high dimen- sion. Although (adaptive) ridge type estimators are alternative procedures which are useful and stable for large dimension. However, we are faced with questions about how to choose ridge parameters and their estimators and how to set up asymptotic order in ridge functions in high dimensional cases. In this paper, we consider general types of ridge estimators for covariance and precision matrices, and derive asymptotic expansions of their risk func- tions. Then we suggest the ridge functions so that the second order terms of risks of ridge estimators are smaller than those of risks of the standard estimators.


Introduction
Statistical inference with high dimension has received much attention in recent years and has been actively studied from both theoretical and practical aspects in the literature. Of these, estimate of the precision matrix is required in many multivariate inference procedures including the Fisher linear discriminant analysis, confidence intervals based on the Mahalanobis distance and weighted least squares estimator in multivariate linear regression models. A standard estimator of the precision based on the sample covariance matrix is likely to be instable when the dimension p is large and close to the sample size N even if N > p. In the case of p > N , the inverse of the sample covariance matrix cannot be defined, and an estimator based on the Moore-Penrose generalized inverse of the sample covariance matrix has been used in Srivastava [11]. Another useful and stable estimator for the precision matrix is a ridge estimator, and its various variants have been used in literature. For example, see Ledoit and Wolf [8], [9], Fisher and Sun [3] and Bai and Shi [1]. However, superiority of the ridge-type estimators over the standard estimators have not been studied except Kubokawa and Srivastava [6], who obtained exact conditions for the ridge-type estimators to have uniformly smaller risks than the standard estimator. However, their results are limited to specific ridge functions and special loss functions.
To specify the problem considered here, let y 1 , . . . , y N be independently and identically distributed (i.i.d.) as a multivariate normal distribution with mean vector µ and p × p positive definite covariance matrix Σ denoted as N p (µ, Σ), (y i − y)(y i − y) t and n = N − 1. Then, in the case of n ≥ p, V has a Wishart distribution with mean nΣ and degrees of freedom n, denoted as W p (Σ, n). When n < p, it is called a singular Wishart distribution, whose distribution has been recently studied by Srivastava [10]. In many inference procedures, an estimate of the precision matrix Σ −1 is required. In the case that n > p, the standard estimator of Σ −1 is Σ −1 0 = cV −1 for a positive constant c, but it may not be stable when p is large and close to n. In 132 T. Kubokawa and A. Inoue the case of p > n, the estimator cV −1 cannot be defined. Srivastava [11] used the estimator cV + based on the Moore-Penrose inverse V + of V .
In this paper, we address the problems of estimating both covariance matrix Σ and precision matrix Σ −1 , and consider general ridge-type estimators, respectively given by where c and d are positive constants based on (n, p), and Λ is a p × p positive definite matrix based on V . Examples of the ridge function Λ include Λ =λI, Λ = diag (λ 1 , . . . ,λ p ) and others, whereλ,λ 1 , . . . ,λ p are functions of V . We evaluate the difference of risk functions of the ridge-type and the standard estimators asymptotically for large n and p, where the risk functions are measured with respect to the scale-invariant quadratic loss functions. Then we derive conditions on d and Λ such that the ridge-type estimators improve on the standard estimators asymptotically. The paper is organized as follows. Section 2 treats estimation of the covariance matrix Σ, and gives asymptotic evaluations for risks of the ridge estimators when (n, p) → ∞. The estimation of the precision matrix Σ −1 is dealt with in Section 3. For estimation of the covariance matrix relative to the scale-invariant quadratic loss, we can handle both cases of n > p and p > n in the unified framework. For the precision matrix, however, the ridge type estimator has different properties between the two cases, and the standard estimator is cV + in the case of p > n, so that we need to treat the two cases separately. Some examples of ridge functions Λ are given in Section 4. Risk performances of the ridge-type estimators are investigated by simulation in Section 5. Concluding remarks are given in Section 6. Some technical tools and proofs are given in the appendix.

A unified result in estimation of covariance
Let X = (x 1 , . . . , x n ) be a p × n random matrix such that x i ∼ N p (0, Σ) for i = 1, . . . , n, where Σ is an unknown positive definite matrix. Let V = XX ′ . In the case of n ≥ p, V is distributed as a Wishart distribution W p (n, Σ) with n degrees of freedom. We first consider the estimation of the covariance matrix Σ in terms of the risk function R(Σ, The loss function is invariant under the scale transformation Σ → AΣA ′ and Σ → A ΣA ′ for any nonsingular matrix A. A standard estimator is of the form cV for c ∈ R + , where R + is a set of real positive numbers, and the optimal c in terms of the risk is given by c 1 = 1/(n + p + 1) and the risk of the estimator Σ 0 = c 1 V is R(Σ, Σ 0 ) = p(p + 1)/(n + p + 1). This can be easily seen for n ≥ p and it follows from Konno [5] for p > n. To improve the estimator c 1 V , we consider a class of estimators given by where Λ is a p × p positive definite matrix based on V , and d is a positive constant. The risk difference between the two estimators Σ Λ and Σ 0 is denoted by To evaluate ∆ asymptotically, we assume the following conditions: (A1) Assume that (n, p) → ∞. Throughout the paper, δ given in the following is a constant satisfying 0 < δ ≤ 1. Assume either (A1-1) or (A1-2) for order of (n, p), where Some examples of statistic Λ satisfying condition (A3) will be given in Section 4.
In this paper, we use the following notations: Theorem 2.1. Assume conditions (A1)-(A3). Then, the risk difference of the estimators Σ Λ = c 1 (V + d Λ) and Σ 0 is approximated as Proof. The risk difference of the estimators Σ Λ = c 1 (V + d Λ) and Σ 0 is written as

T. Kubokawa and A. Inoue
We shall evaluate each term in the r.h.s. of the last equality in (2.3). The first term is written as It is here noted from Cauchy-Shwartz' inequality that the inequality Then, p + O dp p/n (n + p) 3 Under condition (A3), it is observed that In the case of n > p, a consistent estimator of Λ is given by Λ 3 of Example 4.3 in Section 4. However, it is not easy to derive a consistent estimator for Λ when p > n. That is, we could not provide an estimator which minimizes the leading term in (2.2) when d = p and p > n. The approximation given in Theorem 2.1 shows that the leading term in (2.2) is negative when the order of d is less than p.
improves on Σ 0 in terms of second order approximation of risk for any estimator Λ satisfying condition (A3). For instance, take d = 1 and d = max{ √ n, √ p} ≡ d n,p . Then, from (2.2), it follows that for d = 1, ∆ = L n,p + R n,p , for L n,p = − 2p (n + p) 2 tr [ΛΣ −1 ], (2.12) where L n,p = O(n 2(δ−1) ), R n,p = O(n −2+3δ/2 ) for p = O(n δ ), and L n,p = O(1), R n,p = O(p −δ/2 ) for n = O(p δ ) for 0 < δ ≤ 1. Also, for d = d n,p , ∆ = L n,p + R n,p , for L n,p = −2 pd n,p (n + p) 2 tr [ΛΣ −1 ], (2.13) where L n,p = O(n 2δ−3/2 ), R n,p = O(n −1+δ ) for p = O(n δ ) and 1/2 < δ ≤ 1, and L n,p = O(p 1/2 ), R n,p = O(p (1−δ)/2 ) for n = O(p δ ) and 0 < δ ≤ 1. When c = 1/n, V /n is an unbiased estimator of Σ, and we have which is minimized at d = 0. This means that the unbiased estimator cannot be improved on by the ridge-type estimator under the quadratic loss. The optimal c among estimators cV is c 1 = 1/(n + p + 1), and we have The second term of the above equality corresponds to the leading term in (2.2), and Theorem 2.1 guarantees that the risk of c 1 (V + d Λ) with an estimator Λ of Λ can be approximated by the risk of c 1 (V + dΛ) for fixed Λ under conditions (A1)-(A3). Noting that c 1 V shrinks n −1 V toward zero, we can see from Theorem 2.1 that there is a room to improve on c 1 V by expanding it with

Estimation of precision
In this section we consider the estimation of the precision matrix Σ −1 . For estimation of the covariance matrix, we have treated both cases of n > p and p > n in the unified framework. For the precision matrix, however, the ridge type estimator has different properties between the two cases, so that we need to treat the two cases separately.

Case of n > p
We begin by considering the case of n > p. The estimation of the precision matrix Σ −1 is addressed in terms of the risk function R * (Σ, , which is invariant under the scale transformation. A standard estimator of the form cV −1 for c ∈ R + has the risk 0 is that it may be close to be instable when p is large and n − p is small. To modify the estimator Σ −1 0 , we consider a class of estimators given by where Λ is a p × p positive definite matrix based on V satisfying condition (A3). Then, we investigate whether Σ −1 . Assume the following condition: In the case of δ < 1, we can show the following theorem which will be proved in the appendix.
In the case of δ = 1, we assume the condition given by Then, we can get the following result which will be shown in the appendix.
When d = p, this expression can not be approximated anymore, so that we need 0 can be further approximated as Since the leading term in (3.3) is a quadratic function of d, it can be min- , which is the same as in estimation of Σ. This implies that the optimal d is of order p under condition (A4). When Λ is of the form Λ = λI for a positive parameter λ, the minimizing dλ is dλ = ptr [Σ −1 ]/tr [Σ −2 ], which is estimated by a consistent estimator given by Λ 3 of Example 4.3 in Section 4 when n > p.
The approximations given in Theorems 3.1 and 3.2 show that the leading terms in (3.3) and (3.5) are negative when d = o(p).

Case of p > n
We next consider the case of p > n in the estimation of the precision matrix Σ −1 . In this case, V is singular, and there does not exist the inverse of V . A feasible estimator is of the form Relative to this loss function, an approximation of the risk is provided under the following conditions: Also assume that Λ satisfies the following condition: Then, the risk of the estimator Σ −1 Λ given in (3.6) is approximated as The scale-invariant quadratic loss of Σ −1 Λ is written as where the second term in the r.h.s. of the last equality is of order O p (n) from condition (3.7). For the third term, it is observed that Thus, We next evaluate the first term in the r.h.s. of (3.9) as from condition (A6). This shows (3.8).
Concerning condition (3.7), it is seen that if Λ satisfies Ch max ( for large (n, p) satisfying (A1-2), then condition (3.7) is satisfied under condition As an example of Λ, we consider the case of Λ =λI p for positive scalor functionλ of V , namely, the estimator given in (3.6) with c = p is (3.10) Conditions onλ for the approximation of the risk given in (3.8) are provided from Theorem 3.3. For the ridge estimator (3.10), we can also use an approach based on the eigenvalue decomposition. Since this approach is useful for understanding ridge estimators in the case of p > n, we here describe the expressions based on the eigenvalue docomposition for the ridge-type estimator and the risk.
where ℓ 1 ≥ · · · ≥ ℓ n , and H 1 is a p × n matrix satisfying H ′ 1 H 1 = I n . Then, the estimator (3.10) is expressed as , and the risk function is written as . Then the risk expression (3.13) provides conditions onλ for the approximation of the risk. 14) where a 1 = tr [Σ]/p and a 2 = tr [Σ 2 ]/p.
The approach based on the eigenvalue decomposition enables us to approximate the risk of the estimator Σ −1 0 = pV + , where V + is the Moore-Penrose generalized inverse of V . Using the decomposition (3.11), we can rewrite pV + as

141
The risk function of Σ It follows from Lemma A.1 that both of which are of order O(p δ−1 ) when n = O(p δ ) for 0 < δ < 1. Hence, we get the following proposition.
The leading term in (3.16) is minimized at λ = tr [Σ 2 ]/tr [Σ] = a 2 /a 1 . A consistent estimator of Λ = λI is given by Λ 2 in Example 4.2 in the next section. Kubokawa and Srivastava [6] showed that pV + is the best estimator among aV + for a ∈ R + relative to the loss L KS (Σ, , and established the exact dominance result that pV + is dominated by KS has a larger risk than Σ
Based on these statistics, we consider the statistic Similarly to (4.3), we can see that . This shows that Λ 3 satisfies condition (A3). It is noted that p Λ 3 provides an optimal estimator which minimizes the leading terms in (2.2) and (3.3) when n > p. where v ii is the i-th diagonal element of V . Then, Λ 4 is an unbiased estimator of Λ = diag (σ 11 , . . . , σ pp ). We shall verify conditions (A3) and (A6). Note that v ii /σ ii ∼ χ 2 n and that E[(v ii /n − σ ii ) 2 ] = 2σ 2 ii /n. For (A3), it is seen that for Hence, conditions (A6) holds for Λ 4 . For condition (3.7), it is noted that if tr

Simulation studies
We now investigate the numerical performances of the risk functions of the ridge-type estimators through simulation.
As a structure of the covariance matrix, we consider a matrix of the form for a constant ρ on the interval (−1, 1) and σ i = 3 + 0.2(−1) i−1 (p − i + 1)/p. As a model for simulation experiments, we treat the following three cases for random variables x i 's, i = 1, . . . , n, where x i is generated as . . , z pi ) with z i1 , . . . , z pi being mutually independent.
(Case 1) z ij ∼ N (0, 1), The last two cases treat non-normal cases. Since the skewness and kurtosis (K 4 + 3) of χ 2 m is, repectively, (8/m) 1/2 and 3 + 12/m, it is noted that χ 2 2 has higher skewness and kurtosis than χ 2 8 . Let V = XX ′ for X = (x 1 , . . . , x n ). Then for estimation of Σ, we can calculate the four kinds of ridge estimators Σ Λ,i = c(V + d Λ i ) for Λ i 's given in (4.1), (4.2), (4.5) and (4.6), which are denoted by Rid 1 , Rid 2 , Rid 3 and Rid 4 . In the estimation of Σ, c is given by c = 1/(n + p + 1). As values of d, we treat the three cases: d = 1, p and d n,p for d n,p = max{ √ n, √ p}. We use these notations for estimation of Σ −1 , where the constant c is c = m(m − 3)/(n − 1) in the case of n > p and c = p for p > n. It is noted that Λ 3 or Rid 3 is not available for p > n.
The simulation experiments are carried out under the above model for (n, p) = (200, 20), (100, 20), (50, 100) and (80, 100) and ρ = 0.2. Based on 10,000 replications, we calculate averages of the following Relative Risk Gain of the ridge estimators: The simulation results for estimation of Σ are reported in Table 1. In the cases of d = 1 and d = d n,p , the ridge estimators are better than the standard estimator. This agrees with the analytical results given in Corollary 2.1, while the improvements for d = 1 are quite small. It is revealed from Table 1 that the performance of the ridge estimator Rid 1 is better than the others. Thus, the ridge estimator Rid 1 with d = p is recommendable for estimating Σ.  The simulation results for estimation of Σ −1 are reported in Table 2 for n > p and Table 3 for p > n. In the case of n > p, the improvements of the ridge estiamtors with d = 1 over the standard estimator are small. The performances of the ridge estimators with d = d n,p are good. In the case of p > n, the ridge estimator Rid 1 with d = p has a slightly better performance. Thus, we can use the ridge estimator Rid 1 where constant d is given by d = d n,p for n > p and by d = p for p > n.

Concluding remarks
In this paper, we have considered estimation of the covariance and precision matrices by the ridge-type estimators, and have derived asymptotic expansions of their risk functions relative to the scale-invariant quadratic loss functions when the sample size and the dimension are very large. These expansions clarify the conditions for the ridge-type estimators to have smaller risks than the standard estimators in terms of the second-order terms.
The conditions for the improvement depend on the choice of the ridge function Λ and the order of d, namely, in estimation of the covariance matrix, if the following inequality holds then the ridge-type estimators improve on the standard estimators asymptotically relative to the scale-invariant quadratic loss function in both cases of n > p and p > n. It is interesting to note that in estimation of the precision matrix, under the same condition as in (6.1), the ridge-type estimator improves on the standard estimator asymptotically in the case of n > p. However, the condition for the improvement in estimation of the precision matrix in the case of p > n is slightly different from (6.1). Although condition (6.1) always holds asymptotically when d = o(p), it depends on ΛΣ −1 in the case of d = p. Various variants of the ridge-type estimators have been investigated through the performances of the risk functions by simulation.
We would like to mention the loss functions for measuring estimation errors of estimators. In this paper we have used the scale-invariant quadratic loss functions which are used for estimating Σ and Σ −1 , respectively. Another scale-invariant loss functions are the Stein loss functions given by Finally, it is noted that the validity of the asymptotic expansions will not be discussed here. All the results in this paper are based on major terms obtained by Taylor series expansions. Although this paper provides the second order approximations without the validity, we need more conditions and many more steps for establishing the validity of the second-order approximations. was supported in part by Grant-in-Aid for Scientific Research Nos. 21540114 and 23243039 from Japan Society for the Promotion of Science.

A.1. Identities useful for evaluation of moments
The following identity derived by Konno [5] is useful. It is related to the Stein-Haff identity given by Stein [12] and Haff [4] for n > p, but it can be used in both cases of n > p and n ≤ p. Let X = (x 1 , . . . , x n ) be a p × n random matrix such that V = XX ′ and x 1 , . . . , x n are i.i.d. as N p (0, Σ). Let G(V ) be a p × p matrix of functions of V . Then, Konno [5] derived the identity given by where ∇ X = (∂/∂X ij ) for X = (X ij ).
In the case of n > p, we can use the Stein-Haff identity to evaluate higher moments of W = Σ −1/2 V Σ −1/2 . Let G(W ) be a p × p matrix such that the (i, j) element g ij (W ) is a differentiable function of W = (w ij ) and denote {D W G(W )} ac = b d ab g bc (W ), where d ab = 2 −1 (1 + δ ab )∂/∂w ab with δ ab = 1 for a = b and δ ab = 0 for a = b. In the case of n > p, Stein [12] and Haff [4] derived the Stein-Haff identity given by In the case of p > n, the corresponding identity was provided by Kubokawa and Srivastava [6]. This identity was also derived from (A.1) by Konno [5]. Let H = (H 1 , H 2 ) be a p × p orthogonal matrix such that where ℓ 1 ≥ · · · ≥ ℓ n , and H 1 is a p × n matrix satisfying H ′ 1 H 1 = I n . Let ℓ = (ℓ 1 , . . . , ℓ n ) ′ , and Φ(ℓ) = diag (φ 1 (ℓ), . . . , φ n (ℓ)). In the case of p > n, Kubokawa and Srivastava [6] derived the Stein-Haff identity given by Using (A.4), we can evaluate the moments of tr [L −1 ] and tr [L −2 ] from above.
Lemma A.1. In the case of p > n, the following inequalities hold: . which, together with (A.5), provides the second inequality in Lemma A.1.

A.2. Evaluations of moments
Let W = Σ −1/2 V Σ −1/2 , and W has W p (n, I) for n > p. The following lemma provides exact moments of the inverted Wishart matrix W −1 . For the proof, see Kubokawa, Hyodo and Srivastava [7].