Optimal Bayesian Minimax Rates for Unconstrained Large Covariance Matrices

We obtain the optimal Bayesian minimax rate for the unconstrained large covariance matrix of multivariate normal sample with mean zero, when both the sample size, n, and the dimension, p, of the covariance matrix tend to infinity. Traditionally the posterior convergence rate is used to compare the frequentist asymptotic performance of priors, but defining the optimality with it is elusive. We propose a new decision theoretic framework for prior selection and define Bayesian minimax rate. Under the proposed framework, we obtain the optimal Bayesian minimax rate for the spectral norm for all rates of p. We also considered Frobenius norm, Bregman divergence and squared log-determinant loss and obtain the optimal Bayesian minimax rate under certain rate conditions on p. A simulation study is conducted to support the theoretical results.


Introduction
Estimating covariance matrix plays a fundamental role in multivariate data analysis. Many statistical methods in multivariate data analysis such as the principle component analysis, canonical correlation analysis, linear and quadratic discriminant analysis require the estimated covariance matrix as the starting point of the analysis. In the risk management and the longitudinal data analysis, the covariance matrix estimation is a crucial part of the analysis. The log-determinant of covariance matrix is used for constructing hypothesis test or quadratic discriminant analysis [2].
We assume the zero mean and focus on the covariance matrix.
With advance of technology, data arising from various areas such as climate prediction, image processing, gene association study, and proteomics, are often high dimensional. In such high dimensional settings, it is often natural to assume that the dimension of the variable p tends to infinity as the sample size n gets larger, i.e. p = p n −→ ∞ as n −→ ∞. This assumption can be justified as follows. First, when p is large in comparison with n, often the limiting scenario with p tending to infinity approximates closer to the reality than that with p fixed. Second, in many cases we can postulate the reality is infinitely complex and involves infinitely many variables, and with limited resources and time, we can collect only a portion of variables and observations.
If we have more resources to collect more data, it is natural to collect more observations as well as more variables, i.e. to increase both n and p.
When p tends to infinity as n −→ ∞, the traditional covariance estimator is not optimal [32]. The sparsity or bandable assumptions on large matrices have been used frequently in the literature. Many researchers have studied the large sample properties under the restrictive matrix classes. [6] considered the bandable covariance/precision classes and studied the convergence rate of banding estimator on those classes. [44] derived the convergence rate for precision matrices via sparse Cholesky factors and showed that it is the minimax rate under the Frobenius norm.
In addition, the minimax convergence rates for the sparse or bandable covariance matrices were established by [11], [12,13] and [45]. For a comprehensive review on the convergence rate for the covariance and precision matrices, see [10].
The posterior convergence rate has been investigated by [36], [4], and [21]. [36] showed that their continuous shrinkage priors are optimal for the sparse covariance estimation under the spectral norm in the sense that the posterior convergence rate is quite close to the frequentist minimax rate. They achieved a nearly minimax rate upto a √ log n term under the spectral norm and sparse assumption even when n = o(p). [4] considered Bayesian banded precision matrix estimation using graphical models. They obtained the posterior convergence rate of the precision matrix under matrix ∞ norm when log p = o(n). [21] developed a prior distribution for the sparse PCA and showed that it achieves the minimax rate under the Frobenius norm.
They also derived the posterior convergence rate under the spectral norm.
Most of the previous works on the Bayesian estimation of large covariance matrix concentrate on the constrained covariance or precision matrix. To the best of our knowledge, only [22] considered asymptotic results for large unconstrained covariance matrix under the "large p and large n" setting. However, they attained the Bernstein-von Mises theorems under somewhat restrictive assumptions on the dimension p.
In this paper, we fill the gap in the literature. At first, we propose a new decision theoretic framework to define Bayesian minimax rate. The posterior convergence rate is the primary concept when the asymptotic optimality is studied in the Bayesian sense. But it is not completely satisfactory. The following is a quote from [24] which they write just after defining the posterior convergence rate. 'We defined "a" rather than the rate of contraction, and hence logically any rate slower than a contraction rate is also a contraction rate. Naturally we are interested in a fastest decreasing sequence n , but in general this may not exist or may be hard to establish. Thus our rate is an upper bound for a targeted rate, and generally we are happy if our rate is equal to or close to an "optimal" rate. With an abuse of terminology we often make statements like " n is the rate of contraction." ' In the proposed new decision theoretic framework, a probability measure on the parameter space is an action and a prior is a decision rule for it gives a probability measure (the posterior) for a given data set. In this setup, we define the convergence rate and the Bayesian minimax rate.
We investigate the Bayesian minimax rates for unconstrained large covariance matrix. We consider four losses for the covariance inference: spectral norm, Frobenius norm, Bregman divergence and squared log-determinant loss. For the spectral norm, we have the complete result of the Bayesian minimax rate. We show that the Bayesian minimax rate is min(p/n, 1) for all rates of p. For the Frobenius norm and Bregman divergence, we show the Bayesian minimax lower bound is p · min(p, √ n)/n for all rates of p, but obtained the upper bound under the constraint p ≤ √ n. Thus, under the condition p ≤ √ n, the Bayesian minimax rate is p 2 /n. We also show that the Bayesian minimax rate under the squared log-determinant loss is p/n when p = o(n).
The rest of the paper is organized as follows. In section 2, we define the model, the covariance classes we consider, and introduce some notations. We propose the new decision theoretical framework and define the Bayesian minimax rate. The Bayesian minimax rates under the spectral norm, the Frobenious norm, the Bregman matrix divergence, and the squared log-determinant loss are presented in section 3. A simulation study is given in section 4. The discussion is given in section 5, and the proofs are given in Supplementary Material ( [35]).

The Model and the Inverse-Wishart Prior
Suppose we observe a random sample from the p-dimensional normal distribution where Σ n is a p × p positive definite matrix, and p is a function of n such that p = p n −→ ∞ as n −→ ∞. The true value of the covariance matrix is denoted by Σ 0 or Σ 0n , which is dependent on n.
For the prior of the covariance matrix Σ n in model (1), we consider the inverse-Wishart prior where ν n > p − 1, A n is a p × p positive definite matrix for a proper prior. The mean of Σ n is A n /(ν n − p − 1). The condition ν n > p − 1 is needed for the distribution to have a density in the space of p × p positive definite matrices. If ν n is an integer with ν n ≤ p − 1, (2) defines a singular distribution on the space of p × p positive semidefinite matrices [43].
We also consider the truncated inverse-Wishart prior. The inverse-Wishart prior with parameter ν and A whose eigenvalues are restricted in [K 1 , K 2 ] with 0 < K 1 < K 2 is denoted by

Matrix Norms and Notations
We define the spectral norm (or matrix 2 norm) for matrices by where · 2 denotes the vector 2 norm defined by The Frobenius norm is defined by The Bregman divergence [7] is originally defined for vectors, but it can be extended to the real symmetric matrices. Let φ be a differentiable and strictly convex function that maps real symmetric p × p matrices to R. The Bregman divergence with φ between two real symmetric matrices is defined as where A and B are real symmetric matrices and ∇φ is the gradient of φ, i.e., ∇φ(B) = (∂φ(B)/∂B i,j ).
In this paper, we consider a class of φ such that φ(X) = p i=1 ϕ(λ i ) where ϕ is a differentiable and strictly convex real-valued function and λ i 's are the eigenvalues of A. Furthermore, we assume that ϕ satisfies the following properties for some constant τ 1 > 0: (i) ϕ is a twice differentiable and strictly convex function over λ ∈ (τ 1 , ∞); (ii) there exist some constants C > 0 and r ∈ R such that |ϕ(λ)| ≤ Cλ r for all λ ∈ (τ 1 , ∞); and (iii) for any positive constants τ > τ 1 , there exist some positive constants M L and M U such The above class of Bregman matrix divergences includes the squared Frobenius norm, von Neumann divergence and Stein's loss. For their use in statistics and mathematics, see [13], [18] and [34].
The Stein's loss is the Kullback-Leibler divergence between two multivariate normal distributions with means zero and covariance matrices A and B, respectively.
Finally, we introduce some notations for asymptotic analysis which will be used subsequently.
For any positive sequences a n and b n , we say a n b n if there exist positive constants c and C such that c ≤ a n /b n ≤ C for all sufficiently large n. We define a n = o(b n ), if a n /b n → 0 as n → ∞ and a n = O(b n ), if there exist positive constants N and M such that |a n | ≤ M |b n | for all n ≥ N . For any random variables X n and X, X n d −→ X means the convergence in distribution.
For any real symmetric matrix A, A > 0 (A ≥ 0) means that the matrix A is positive definite (nonnegative definite). We denote δ A as the dirac measure at A.

A Class of Covariance Matrices
Let C p denote the set of all p × p covariance matrices. For any positive constants τ, τ 1 and τ 2 , define the class of covariance matrix where λ min (Σ) is the smallest eigenvalue of Σ. Throughout the paper, we consider the model (1) and assume that the true covariance matrix belongs to C(τ ) or C(τ 1 , τ 2 ).
Often the subgaussian property is used to relax the Gaussian distribution assumption. The distribution of random vector X has subgaussian property with variance factor τ > 0, if for all t > 0 and v = 1. The subgaussian property with variance factor τ implies Var(X) ≤ 2τ . In the literature, the subgaussian distribution is frequently used as a basic assumption, for examples, [11], [12,13] and [45]. If X follows a multivariate normal distribution, Σ ≤ τ is a sufficient condition for X to have the subgaussian property.

Decision Theoretic Prior Selection
Let d(Σ, Σ ) be a pseudo-metric that measures the discrepancy between two covariance matrices Σ and Σ . A sequence n −→ 0 is called a posterior convergence rate at the true parameter Σ 0 if for any M n −→ ∞, in P Σ 0 -probability as n −→ ∞. The convergence rate is measured by the rate of n , which allows that the posterior contraction probability converges to zero in probability P Σ 0 , where P Σ 0 is the distribution for random sample (X 1 , . . . , X n ) iid ∼ N p (0, Σ 0 ). In the literature, the posterior is said to achieve the minimax rate if its convergence rate is the same as the frequentist minimax rate ( [36]; [21]; [29]). Since the posterior convergence rate cannot be faster than the frequentist minimax rate ( [28]), it is often called the optimal rate of posterior convergence ( [40]; [38]).
However, its definition is elusive as the quote from [24] indicates.
As an alternative framework for the evaluation of the prior and the posterior, we take a frequentist decision theoretical approach. For each n, the parameter space is C p and the action space is the set of all probability measures on C p . After the data X n is collected, the posterior π(·|X n ) is computed for the given prior π and the posterior takes a value in the action space.
In this setup, the prior can be considered as a decision rule, because the prior and observations together produce the posterior. A probability measure in the action space will be used as a posterior for the inference, but it does not have to be generated from a prior. We define the loss and risk function of the parameter Σ 0 and the prior π as L(Σ 0 , π(·|X n )) := E π d(Σ, Σ 0 )|X n ), Note that the risk function measures the performance of the prior π. To distinguish them from the usual loss and risk, we call the above loss and risk as posterior loss (P-loss) and posterior risk (P-risk). The P-risk itself is not new. For example, the P-risk was also used in [14] for density estimation on the unit interval.
There are a couple of benefits of the proposed decision theoretic prior selection. First, the decision theoretic prior selection makes the definition of the minimax rate of the posterior mathematically concrete. Although the minimax rate of the posterior is used frequently, it has been used without a rigorous definition. The frequentist minimax rate is used as a proxy of the desired concept. Second, in the study of the posterior convergence rate, the scale of the loss function needs to be carefully chosen so that the posterior consistency holds. But in the proposed decision theoretic prior selection, the inconsistent priors can be compared without any conceptual difficulty. Thus, the scale of the loss function does not need to be chosen.
We now define the minimax rate and convergence rate for P-loss. Let Π n be the class of all priors on Σ n . A sequence r n is said to be the minimax rate for P-loss (P-loss minimax rate) or simply the Bayesian minimax rate for the class C * p ⊂ C p and the space of the prior distributions A prior π * is said to have a convergence rate for P-loss (P-loss convergence rate) or convergence rate a n , if and, if a n r n where r n is the minimax rate for P-loss, π * is said to attain the minimax rate for P-loss or the Bayesian minimax rate. If it is clear from context, we will drop P-loss and refer them as the minimax rate and the convergence rate. For a given inference problem, we wish to find a prior π * which attains the minimax rate for P-loss.
Remark The P-loss convergence rate implies the posterior convergence rate by Proposition A.1 in Supplementary Material ( [35]). By obtaining the P-loss convergence rate, we also get the traditional posterior convergence rate. The converse may not be true, because for certain loss functions, the P-loss may not even converge to 0 while the posterior convergence rate converges to 0.
Remark The P-loss convergence rate is slower than or equal to the frequentist minimax rate by Proposition A.2 in Supplementary Material ( [35]). To obtain a P-loss minimax lower bound, the mathematical tools for frequentist minimax lower bound can be used.
Remark If we assume that the prior class Π n includes the data dependent priors, the P-loss minimax rate is the same as the frequentist minimax rate. Take π = δΣ * whereΣ * is an estimator attaining the frequentist minimax rate. Then, π attains the frequentist minimax rate and thus attains the Bayesian minimax rate. However, the data-dependent prior is not acceptable for legitimate Bayesian analysis unless the prior is dependent on ancillary statistics. Even if Π n does not contain data-dependent priors, in most cases the frequentist and P-loss minimax rates are the same.
However, if we consider a restricted class of priors, the P-loss minimax rate might differ from the usual frequentist minimax rate. In such cases, the frequentist minimax rate will not be a natural concept to study the asymptotic properties of the posterior. See Remark in subsection 3.2.

Bayesian Minimax Rates under Various Matrix Loss Functions 3.1 Bayesian Minimax Rate under Spectral Norm
In this subsection, we show that the Bayesian minimax rate for covariance matrix under the spectral norm is min(p/n, 1). We also show that the prior attains the Bayesian minimax rate for the class C(τ 1 , τ 2 ) under the spectral norm, where IW p (Σ | ν n , A n ) is the inverse-Wishart distribution, ν n > p − 1 and A n is a p × p positive definite matrix.
We have the complete result for all values of n and p. The Bayesian minimax rate holds for any n and p, regardless of their relationship. The number 1/2 in the prior (3) can be replaced by any number in (0, 1) and the prior still renders the minimax rate.
The main result of the section is given in Theorem 3.1 whose proof is given in Supplementary Material ( [35]). We divide the proof into two parts: lower bound and upper bound parts. First, we show that the lower bound of the frequentist minimax rate is min(p/n, 1), which may be of interest in its own right, and it in turn implies that min(p/n, 1) is a Bayesian minimax lower bound. After that, the P-loss convergence rate with the prior (3) is derived, which is the same as the Bayesian minimax lower bound when ν 2 n = O(np) and A n = S n . Consequently, we obtain the following theorem by combining these two results. Throughout the paper, Π n is the class of all priors on Σ n ∈ C p as we have defined in subsection 2.4.  Remark The proof for the lower bound holds even for τ 1 and τ 2 depending on n and possibly for τ 1 −→ 0 and τ 2 −→ ∞ as n −→ ∞. In such cases, the rate of the minimax lower bound is τ 2 2 · min (p/n, 1). For details, see Theorem B.1 in the Supplementary Material ( [35]). Note that τ 2 affects the minimax lower bound, while τ 1 does not. A similar phenomenon occurs for estimation of sparse spiked covariance matrices. See Theorem 4 of [10].
We have complete results of the Bayesian minimax rate under the spectral norm. In words, the results above do not have any condition on the rate of p and n. For a given rate of p, we obtained the Bayesian minimax rate. When p grows the same rate as n, the above theorem shows that estimating the covariance under the spectral norm is hopeless. Indeed, this can be seen from the form of the prior (3). When p ≥ n/2, the point mass prior δ Ip gives the Bayesian minimax rate. In words, you can not do better than the useless point mass prior δ Ip .
Applying techniques used in the proof of the upper bound, one can show that the prior (3) also gives the same P-loss convergence rate for precision matrix. For any positive constants τ 1 < τ 2 , for all sufficiently large n and some constant c > 0.
We remark here that [22] derived a posterior convergence rate for unconstrained covariance matrix under the spectral norm when p = o(n). In this paper, we obtained a P-loss convergence rate which implies the stronger convergence than a posterior convergence rate, for any n and p. [22] also attained a posterior convergence rate for precision matrix under p 2 = o(n). In this paper, Corollary 3.2 gives a P-loss convergence rate for any n and p.

Bayesian Minimax Rate under Frobenius Norm
Throughout this subsection, τ > 0 can depend on n and possibly τ −→ ∞ as n −→ ∞. In this subsection, we show that the rate of the Bayesian minimax lower bound for covariance matrix under Frobenius norm is τ 2 · min(p, √ n) · p/n for the class C(τ ), and the inverse-Wishart prior attains the Bayesian minimax lower bound when p ≤ √ n.
The following theorem gives the Bayesian minimax lower bound. The proof of Theorem 3.3 is given in Supplementary Material ( [35]). In the proof of the theorem, we prove that the lower bound of the frequentist minimax rate is τ 2 · min(p, √ n) · p/n as a by-product.
for all sufficiently large n and some constant c > 0.
Theorem 3.4 Consider the model (1) and prior (2) with ν n > 0 and A n > 0 for all n. If ν n = p and A n 2 = O(n), for any τ > 0, for some constant c > 0 and all sufficiently large n. Furthermore, if p ≤ √ n, ν 2 n = O(np) and A n 2 = O(np) is the necessary and sufficient condition for achieving the rate p 2 /n.
Note that if τ > 0 is a fixed constant, from the relationship between the spectral norm and Frobenius norm, one can obtain a P-loss convergence rate min(p, n) · p/n instead of p 2 /n in Theorem 3.4. However, in this case, one should restrict the parameter space to C(τ 1 , τ 2 ) instead of the more general parameter space C(τ ).
In practice, we recommend using ν n = p and small A n such as A n = O p or A n = I p , where O p denotes a p × p zero matrix because it guarantees the rate p 2 /n regardless of the relation between n and p. Note that the Jeffreys prior [31] and the prior proposed by [23] π(Σ n ) ∝ det(Σ n ) −p satisfy the above conditions. They can be viewed as inverse-Wishart priors, IW (ν n , A n ), with parameters (1, O p ), (0, O p ) and (p − 1, O p ), respectively. Furthermore, the IW (p + 1, S n ) prior, whose mean is S n , also satisfies the conditions in Theorem 3.4.
By Theorem 3.4 and Theorem 3.3, we have the Bayesian minimax rate τ 2 ·p 2 /n for covariance matrix under the Frobenius norm when p ≤ √ n. Thus, with the inverse-Wishart prior, we attain the Bayesian minimax rate under the Frobenius norm.
Furthermore, ν 2 n = O(np) and A n 2 = O(np) is the necessary and sufficient condition for the prior (2) to achieve the Bayesian minimax rate when p ≤ √ n.
Remark In section 2.4, we have said that the Bayesian minimax rate can be different from the frequentist minimax rate when a restricted prior class is considered, and that the frequentist minimax rate will not be a natural concept to address the asymptotic properties of the posteriors from a restricted prior class. We give an example here. Consider a prior class Π * n = {π ∈ IW p (ν n , A n ) : ν n ≥ n, A n ∈ C p } and assume p ≤ √ n. It is easy to check that from the proof of Theorem 3.4. Note that the obtained P-loss minimax rate differs from the usual frequentist minimax rate, τ 2 · p 2 /n.

Bayesian Minimax Rate under Bregman matrix Divergence
In this section, we obtain the Bayesian minimax rate under a certain class of Bregman matrix divergences. Let Φ be the class of differentiable and strictly convex real-valued functions satisfying (i)-(iii) conditions in the subsection 2.2, and let D Φ be the class of Bregman matrix divergences To achieve the Bayesian minimax convergence rate for Bregman matrix divergences, we use the truncated inverse-Wishart distribution IW p (ν n , A n , K 1 , K 2 ) whose eigenvalues are all in [K 1 , K 2 ] for some positive constants K 1 < K 2 . The density function of IW p (ν n , A n , K 1 , K 2 ) is given by where ν n > p − 1 and A n is a p × p positive definite matrix.
Theorem 3.6 Consider the model (1). If p ≤ √ n, for any positive constants To extend the minimax result for the squared Frobenius norm to the Bregman matrix divergence, the posterior distribution for Σ n and the true covariance Σ 0 should be included in the class C(K 1 , K 2 ) and C(τ 1 , τ 2 ), respectively, for some positive constants K 1 < τ 1 and K 2 > τ 2 .
The truncated inverse-Wishart prior was needed to restrict the posterior distribution for Σ n within the class C(K 1 , K 2 ). In practice, we recommend using sufficiently small K 1 and large K 2 .
According to the above theorem, the minimax convergence rate for the class D Φ is equivalent to that for the Frobenius norm if we consider the parameter space C(τ 1 , τ 2 ). Moreover, the truncated inverse-Wishart prior IW p (ν n , A n , K 1 , K 2 ) achieves the Bayesian minimax rate. The proof of the theorem is given in Supplementary Material ( [35]).

Bayesian Minimax Rate of Log Determinant of Covariance Matrix
In this subsection, we establish the Bayesian minimax rate for the log-determinant of the covariance matrix under squared error loss. The frequentist minimax lower bound was derived by [9].
We prove that the inverse-Wishart prior achieves the Bayesian minimax rate when p = o(n).
The estimator of the log-determinant of the covariance matrix can be used as a basic ingredient for constructing hypothesis test or the quadratic discriminant analysis [2]. The logdeterminant of the covariance matrix is needed to compute the quadratic discriminant function for multivariate normal distribution where x is the random sample from N p (µ, Σ). Furthermore, the differential entropy of N p (µ, Σ) is given by so the estimation of the differential entropy is equivalent to estimation of the log-determinant of the covariance matrix, when we consider the multivariate normal distribution. The differential entropy has various applications including independent component analysis (ICA), spectroscopy, image analysis, and information theory. See [5], [19], [30] and [16]. [9] showed that the minimax rate for the log-determinant of the covariance matrix under squared error loss is p/n and their estimator achieves this optimal rate when p = o(n).
On the Bayesian side, [41] and [26] Furthermore, prior (2) with ν 2 n = O(n/p) and A n = O p attains the Bayesian minimax rate.
Remark One can also show that the optimal minimax convergence rate is achieved by using the prior (2) with ν 2 n = O(n/p), A n = c n S n and c 2 n = O(n/p).
Remark [22] showed the Bernstein-von Mises result for the log-determinant of covariance, which implies a posterior convergence rate. However, they considered a restrictive parameter space C(τ 1 , τ 2 ) and the stronger condition p 3 = o(n). In this paper, the more general parameter space C p and weaker condition p = o(n) are sufficient for the stronger result, a P-loss convergence rate.

Simulation study
In this section, we support our theoretical results by a simulation study. The simulations for three loss functions, spectral norm, square of scaled Frobenius norm and squared log-determinant loss, were conducted. We compare the performance of the minimax priors with those of some frequentist estimators.
We choose the posterior mean as a Bayesian estimator. The posterior mean obtained from the minimax prior attains the minimax rate in Theorem B.2, Theorem 3.4 and Theorem 3.7 by the Jensen's inequality.
We generated dataset X 1 , . . . , X n from N p (0, Σ 0 ) where true covariance matrix Σ 0 was either diagonal or full covariance matrix. A full covariance matrix is a covariance matrix which does not have any restriction on its elements such as sparsity or banding. In the diagonal covariance setting, the true covariance is Σ 0 = diag(σ 0,ii ) where σ 0,ii iid ∼ U nif (0, 5). In the full covariance setting, we made the true covariance iid ∼ N (0, 5/p). In the simulation study, the dimensions of the true covariance matrices are 25, 50, 100 and 200, and the numbers of data n are either n = p 2 or n = p 3/2 . For each setting, we generated a true covariance once for which we generated 100 data sets and calculated estimators of the covariance.
For the spectral norm and square of scaled Frobenius norm loss, we computed the posterior mean of the inverse-Wishart prior, IW (ν n , A n ), for comparison. We chose ν n = 2, n/p, p and n to see the effect of the ν n , but fixed A n = O p to remove the prior effect on the structure of the covariance estimate. By Theorems B.2 and 3.4, when n = p 2 , the inverse-Wishart prior with ν n = 2, n/p and p are minimax priors, while that with ν n = n is not. We also computed the sample covariance S n and the tapering estimatorΣ k [11] for comparison. As mentioned before, the sample covariance matrix is a Bayesian estimator using inverse-Wishart prior with ν n = p+1 and A n = O p , which satisfies the conditions in Theorem 3.4. We used k = √ n as the threshold of tapering estimator. It corresponds to α = 0 in [11], which gives the minimal sparse constraint for the covariance matrix in their class.  Figure 1 show the results when the true covariance matrix is a diagonal and full covariance, respectively; the left and right columns are the results when n = p 2 and n = p 3/2 , respectively.
The inverse-Wishart prior with ν n = p and the sample covariance performed well in all cases.
They are either the best or comparable to the best. When n = p 3/2 , the truncated inverse-Wishart prior with ν n = n is not minimax, and the simulation results show that it performed the worst or the second to the worst. The inverse-Wishart priors with ν n = 2 and n/p are minimax, and thus their risks decrease as n −→ ∞ in all cases, but their performance are slightly worse than that with ν n = p. The tapering estimatorΣ k performed the best in diagonal settings because it gives zero to many of upper and lower diagonal elements or shrink them toward zero.
However, in the full covariance settings, it performed the worst or close to the worst for the same reason.   The UMVUE of log det Σ is given by where ψ is the digamma function which is defined by ψ(x) = d/dz log Γ(z)| z=x where Γ is the gamma function. See [1] for more details. We tried the same settings for inverse-Wishart prior as before. Note that for n = p 2 and n = p 3/2 , the choices ν n = 2 and n/p satisfy the sufficient condition in Theorem 3.7 while ν n = p and n do not. The posterior mean of the log-determinant for the inverse-Wishart prior is Thus, the UMVUE is the same as the Bayesian estimator using inverse-Wishart prior with ν n = 0 and A n = O p , which satisfies the sufficient condition in Theorem 3.7.

Discussion
In this paper, we develop a new framework for the Bayesian minimax theory, and introduce Bayesian minimax rate and P-loss convergence rate. The proposed decision theoretic framework gives an alternative way to distinguish the good priors from the inadequate ones and makes the definition of the minimax rate of the posterior clear. We obtain the Bayesian minimax rates for the normal covariance model under the various loss functions: spectral norm, the squared Frobenius norm, Bregman matrix divergence and squared log-determinant loss for large covariance estimation. We show that the inverse-Wishart prior or truncated inverse-Wishart prior attains the Bayesian minimax rate. The simulation results support the theory obtained.

A Basic properties of P-loss convergence rate
A frequentist minimax lower bound is defined as a lower bound of whereΣ denotes an arbitrary estimator of Σ 0 , and we say r n is the frequentist minimax rate for the class C p and the space of the estimators of Σ 0 , if Propositions A.1 and A.2 state two basic properties of P-loss convergence rate and the Bayesian minimax rate.
Proposition A.1 For any Σ 0 ∈ C p , a P-loss convergence rate at Σ 0 is a posterior convergence rate at Σ 0 .
Proof Suppose that the rate of the P-loss convergence rate at Σ 0 ∈ C p is n , i.e., For a sequence M n −→ ∞ and δ > 0, The first and second inequalities follow from the Markov inequality.
Proposition A.2 A frequentist minimax lower bound for Σ 0 is also a P-loss minimax lower bound for any loss function d(·, Σ 0 ), i.e., whereΣ denotes an arbitrary estimator of Σ 0 .
Proof Note that the P-risk is always equal or larger than the posterior convergence rate by Markov's inequality, and the frequentist minimax rate is a lower bound for the posterior convergence rate ( [28]). Thus, the frequentist minimax rate is also a lower bound for the P-loss minimax rate.
for all sufficiently large n and some constant c > 0. For any positive constants τ 1 < τ 2 , for all sufficiently large n and some constant c > 0.
Lemma B.5 Let P 0 , P 1 ∈ P where P is a set of all probability measures on X and let f 0 and f 1 be their density functions, respectively. Define ξ = ξ(P 0 , P 1 ) := X f 2 1 /f 0 dx and set θ i = θ(P i ), i = 0, 1, where θ is a functional defined on P. Then where δ denotes any estimator of θ and E i represents the expectation with respect to P i , i = 0, 1.
where u, v ∼ U nif (U). The fifth equality is derived from Lemma B.3. We will show that ξ ≤ C for some constant C > 0 for all sufficiently large n. If p does not grow to infinity, i.e., p ≤ C for some constant C > 0, the last term bounded above easily, E(exp(2n 2 u, v 2 )) ≤ exp(2c 2 p) ≤ exp(2c 2 C). If p tends to infinity, by the Lemma B.4, note that Bin(p, 1/2). Note also that we have by Theorem 1 of [33] for 0 < c < 1/2, Z ∼ N (0, 1). In our setting, consider F p as the distribution function of p(2B/p − 1) 2 . Thus, we get the followings by taking = c p/n for some small c > 0 for some c > 0 which proves the lower bound when n ≥ p.
Now, assume n < p and define Earlier result shows that for some c > 0.

B.2 Proof of Theorem B.2
Lemma B.6 Let Ω n ∼ W p (ν n , ν −1 n A n ) with ν n > p and positive definite matrix A n , for all n ≥ 1 and A n ≤ τ n for all sufficiently large n. Then, there exist positive constants c 1 and c 2 such that P( Ω n − A n ≥ x) ≤ 5 p e −c 1 νnx 2 /τ 2 n + e −c 2 νnx/τn for all x > 0.
Proof There exist v j with v j 2 = 1 for j = 1, . . . , 5 p , such that for any p × p symmetric matrix A (Page 2141 of [11]). Thus, we have
Proof It follows from Corollary 5.35 in [20], for any t ≥ 0. If we choose t = √ ν n for (5), it gives the first inequality If we choose t = √ ν n (1 − p/ν n − (1 − p/ν n )/2) > 0 for (6), it gives the second inequality Proof of Theorem B.2 We prove the upper bound for p ≤ n/2 case first. Note that whereΣ n := (nS n + A n )/(n + ν n ). Consider the first term of right hand side (RHS) of (7).
for any constant C 1 and C 2 . The integrand of (8) is bounded by To show that for any x > 0 and some positive constants C 3 and C 4 by Lemma B.6. If we choose x = C 5 · p/n for some large C 5 > 0, the rate of (10) is p/n. Note that by Hölder's inequality. One can easily show that E π ( Σ n 4 | X n ) is bounded above by p 5 up to some constant factor because Σ n ≤ tr(Σ n ) and E π (tr(Σ n ) 4 | X n )I( Σ n ≤ C 1 and Σ −1 for some constant C 6 ≥ C 1 · 4(1 − 1/2) −2 by Lemma B.7. Thus, we have shown that the rate of (8) is smaller than p/n. Now, we show that the rate of (9) is smaller than p/n. Note that (9) is bounded by SinceΣ n = (nS n + A n )/(n + ν n ) and Σ 0 ∈ C(τ 1 , τ 2 ), we have whereS n := Σ −1/2 0 S n Σ −1/2 0 ∼ W p (n, n −1 I p ). Then, (11) is bounded by 2e −n/2 for some constant C 1 > 0 by Lemma B.7. Similarly, for some constant C 2 , by Lemma B.7. It is easy to show that By applying t = √ n( τ −1 2 (u − A n /(n + ν n )) − 1 − p/n) to the tail inequality (5), we have for some constant C 7 > 0. Also note that 2e −nC 10 √ u/2 du for some positive constants C 8 , C 9 and C 10 by applying the tail inequality (5). Thus, we have shown that the rate of (9) is faster than p/n.
For the second term of RHS of (7), note that Since ν 2 n = O(np) and A n 2 = O(np), it is trivial that ν n /(n + ν n ) p/n and A n /(n + ν n ) p/n. One can show that E Σ 0 S n − Σ 0 ≤ E Σ 0 S n − I p · Σ 0 p/n by Lemma B.6.
Furthermore, it is easy to prove that E Σ 0 S n 1 because we have proved E Σ 0 Σ n 2 1. Thus, For the case p > n/2, we have which has the same rate with min(p/n, 1).
Proof of Theorem 3.2 It suffices to consider the case p ≤ n/2 because the other part is trivial.
Note that For the term (12), we have by the argument (10) in the proof of Theorem B.2. For the term (13), note that by Lemma B.7. The last term (14) is bounded above by By the Woodbury formula, it is easy to show that

C Proof of Theorem 3.3
Before we prove Theorem 3.3, we define the total variation affinity and the L 1 -distance between measures.
L 1 -distance Let P and Q be probability measures with density functions p and q with respect to a σ-finite measure ν, respectively. Let be the total variation affinity between P and Q, and be the L 1 -distance between P and Q. For the proof of Assouad's lemma, see [3].
Lemma C.2 For any p × p symmetric matrix B such that I p + tB is a positive definite matrix for any t ∈ [0, 1] and B F is small, Proof of lemma C. Proof of Theorem 3. 3 We follow closely the line of a proof in [11]. By the Jensen's inequality, for any A ⊂ C(τ ), whereΣ n = E π (Σ n | X n ). We show that for any τ > 0 and A ⊂ C(τ ), for some constant c > 0. Note that τ can depend on n and possibly τ −→ ∞ as n −→ ∞.
We have the upper bound of the rate for the P-loss convergence rate ≤ c (n + ν n − p) 3 n 2 p 2 τ 2 + np 2 A n τ + p 2 A n 2 + c (n + ν n − p) 2 np 2 τ 2 + (ν n − p) 2 pτ 2 + p A n 2 (16) for some constant c > 0. Now, we get the upper bound if we assume ν n = p and A n 2 = O(n).
If we assume p ≤ √ n, each term in (16) should be smaller than τ 2 · p 2 /n to obtain the minimax rate. Under this condition, ν 2 n = O(np) and A n 2 = O(np) is the necessary and sufficient condition to attain the minimax rate τ 2 · p 2 /n.

E Proof of Theorem 3.6
To obtain the minimax posterior rate of the Bregman divergence, we need the following lemma from [13].