Bernstein-von Mises Theorems for Functionals of Covariance Matrix

We provide a general theoretical framework to derive Bernstein-von Mises theorems for matrix functionals. The conditions on functionals and priors are explicit and easy to check. Results are obtained for various functionals including entries of covariance matrix, entries of precision matrix, quadratic forms, log-determinant, eigenvalues in the Bayesian Gaussian covariance/precision matrix estimation setting, as well as for Bayesian linear and quadratic discriminant analysis.


Introduction
The celebrated Bernstein-von Mises (BvM) theorem [20,3,29,21,27] justifies Bayesian methods from a frequentist point of view. It bridges the gap between Bayesians and frequentists. Consider a parametric model P θ : θ ∈ Θ , and a prior distribution θ ∼ Π. Suppose we have i.i.d. observations X n = (X 1 , ..., X n ) from the product measure P n θ * . Under some weak assumptions, Bernstein-von Mises theorem shows that the conditional distribution of √ n(θ −θ)|X n is asymptotically N (0, V 2 ) under the distribution P n θ * with some centeringθ and covariance V 2 when n → ∞. In a local asymptotic normal (LAN) family, the centeringθ can be taken as the maximum likelihood estimator (MLE) and V 2 as the inverse of the Fisher information matrix. An immediate consequence of the Bernstein-von Mises theorem is that the distributions √ n(θ −θ)|X n and √ n(θ − θ)|θ = θ * The paper is organized as follows. In Section 2, we state the general theoretical framework of our results. It is illustrated with two priors, one conjugate prior and one non-conjugate prior. Section 3 considers specific examples of matrix functionals and the associated BvM results. The extension to discriminant analysis is developed in Section 4. Finally, we devote Section 5 to some discussions on the assumptions and possible generalizations. Most of the proofs are gathered in Section 6.

Notation
Given a matrix A, we use ||A|| to denote its spectral norm, and ||A|| F to denote its Frobenius norm. The norm || · ||, when applied to a vector, is understood to be the usual vector norm. Let S p−1 be the unit sphere in R p . For any a, b ∈ R, we use notation a ∨ b = max(a, b) and a ∧ b = min(a, b). The probability P Σ stands for N (0, Σ) and P (µ,Ω) is for N (µ, Ω −1 ). In most cases, we use Σ to denote the covariance matrix, and Ω to denote the precision matrix (including those with superscripts or subscripts). The notation P is for a generic probability, whenever the distribution is clear in the context. We use O P (·) and o P (·) to denote stochastic orders under the sampling distribution of the data. We use C to indicate constants throughout the paper. They may be different from line to line. X i X T i .

A General Framework
We deliberately omit the logarithmic normalizing constant in l n (Ω) for simplicity and it will not affect the definition of the posterior distribution. Note that specifying a prior on the precision matrix Ω is equivalent to specifying a prior on the covariance matrix Ω −1 . The goal of this work is to show that the asymptotic distribution of the functional f (Ω) under the posterior distribution is approximately normal, i.e., where Z ∼ N (0, 1), as (n, p) → ∞ jointly with some appropriate centeringf and variance V 2 . In this paper, we choose the centeringf to be the sample version of f (Ω) = f (Σ −1 ), where Σ is replaced by the sample covarianceΣ, and compare the BvM results with the classical asymptotical normality forf in the frequentist sense. Other centeringf , including bias correction on the sample version, will be considered in the future work. We first provide a framework for approximately linear functionals, and then use the general theory to derive results for specific examples of priors and functionals. For clarity of presentation, we consider the cases of functionals of Σ and functionals of Ω separately. Though a functional of Σ is also a functional of Ω, we treat them separately, since some functional may be "more linear" in Σ than in Ω, or the other way around.

Functional of Covariance Matrix
Let us first consider a functional of Σ, f = φ(Σ). The functional is approximately linear in a neighborhood of the truth. We assume there is a set A n satisfying for any sequence δ n = o(1), on which φ(Σ) is approximately linear in the sense that there exists a symmetric matrix Φ such that The main result is stated in the following theorem.
The theorem gives explicit conditions on both prior and functional. The first condition says that the posterior distribution concentrates on a neighborhood of the truth under the spectral norm, on which the functional is approximately linear. The second condition says that the bias caused by the shifted parameter can be absorbed by the posterior distribution. Under both conditions, Theorem 2.1 shows that the asymptotic posterior distribution of φ(Σ) is N φ(Σ), 2n −1 Σ * 1/2 ΦΣ * 1/2 2 F .

Functional of Precision Matrix
We state a corresponding theorem for functionals of precision matrix in this section. The condition for linear approximation is slightly different. Consider the functional f = ψ(Ω). Let A n be a set satisfying for some integer r > 0 and any sequence δ n = o (1). We assume the functional ψ(Ω) is approximately linear on A n in the sense that there exists a symmetric matrix Ψ satisfying rank(Ψ) ≤ r, such that sup An √ n Ω * 1/2 ΨΩ * 1/2 −1 The main result is stated in the following theorem.

Priors
In this section, we provide examples of priors. In particular, we consider both a conjugate prior and a non-conjugate prior. Note that the result of a conjugate prior can be derived by directly exploring the posterior form without applying our general theory. However, the general framework provided in this paper can handle both conjugate and non-conjugate priors in a unified way.

Wishart Prior
Consider the Wishart prior W p (I, p + b − 1) on Ω with density function supported on the set of symmetric positive semi-definite matrices.
for some M > 0.

Gaussian Prior
Consider Gaussian prior on Ω with density function supported on the following set for some constant Λ > 0.

Quadratic Form
Consider the functional φ v (Σ) = v T Σv = tr(Σvv T ) and ψ v (Ω) = vΩv T = tr(Ωvv T ) for some v ∈ R p . Therefore, the corresponding matrices Φ and Ψ are vv T . It is easy to see that rank(vv T ) = 1. The asymptotic variances are Plugging these representations in If we additionally assume p 2 /n = o(1), then If we additionally assume p 3 log n n = o(1), then Remark 3.1. The entry-wise functional and the quadratic form are both special cases of the functional u T Σv for some u, v ∈ R p . It is direct to apply the general framework to this functional and obtain the result Similarly, for the functional u T Ωv for some u, v ∈ R p , we have Both results can be derived under the same conditions of Corollary 3.3 and Corollary 3.4.

Log Determinant
In this section, we consider the log-determinant functional. That is φ(Σ) = log det(Σ). Different from entry-wise functional and quadratic form, we do not need to consider log det(Ω) because of the simple observation log det(Ω) = − log det(Σ).
The following lemma establishes the approximate linearity of log det(Σ).
By Lemma 3.1, the corresponding matrix Φ is Ω * . The asymptotic variance of whereΣ is the sample covariance matrix.
Proof. By Theorem 2.1 and Lemma 2.1, we only need to check the approximate linearity of the functional. According to the proof of Lemma 2.1, the choice of A n such that Π(A n |X n ) = 1 − o P (1) is for some M > 0. This implies ||Σ − Σ * || F ≤ M p 2 n . Therefore, whereΣ is the sample covariance matrix.
Proof. The proof of this corollary is the same as the proof of the last one using Wishart prior. The only difference is that the choice of A n , according to the proof of Lemma 2.2, is for some M > 0. Therefore, for some δ n = o(1) under the assumption, and the approximate linearity holds.
One immediate consequence of the result is the Bernstein-von Mises result for the entropy functional, defined as Then it is direct that 2n p H(Σ) − H(Σ) X n ≈ N (0, 1).

Eigenvalues
In this section, we consider the eigenvalue functional. In particular, let {λ m (Σ)} p m=1 be eigenvalues of the matrix Σ with decreasing order. We investigate the posterior distribution of λ m (Σ) for each m = 1, ..., p. Define the eigen-gap The asymptotic order of δ plays an important role in the theory. The following lemma characterizes the approximate linearity of λ m (Σ).
where u * m is the m-th eigenvector of Σ * .
Lemma 3.2 implies that the corresponding Φ in the linear expansion of φ(Σ) is u * m u * T m , and the asymptotic variance is We also consider eigenvalues of the precision matrix. With slight abuse of notation, we define the eigengap of λ m (Ω * ) to be The approximate linearity of λ m (Ω) is established in the following lemma.
where u * m is the m-th eigenvector of Ω * .
Similarly, Lemma 3.3 implies that the corresponding Ψ in the linear expansion of ψ(Ω) is u * m u * T m , and the asymptotic variance is Plugging the above lemmas into our general framework, we get the following corollaries.
whereΣ is the sample covariance matrix. If we instead assume p δ √ n = o(1) with δ being the eigengap of λ m (Ω * ), then Proof. We only need to check the approximate linearity. According to Lemma 2.1, the choice of A n is on the set A n . By Lemma 3.2 and Lemma 3.3, we have Corollary 3.8. Consider the Gaussian prior Π in (6). Assume ||Σ * || ∨ ||Ω * || ≤ Λ = O(1) and p 2 log n δ √ n = o(1), then we have whereΣ is the sample covariance matrix. If we instead assume p 2 log n δ √ n = o(1) with δ being the eigengap of λ m (Ω * ), then Proof. We only need to check the approximate linearity. According to Lemma 2.2, the choice of A n is for some M > 0. The assumption p 2 log n δ √ n = o(1) implies on the set A n . By Lemma 3.

Discriminant Analysis
In this section, we generalize the theory in Section 2 to handle the BvM theorem in discriminant analysis. Let X n = (X 1 , ..., X n ) and Y n = (Y 1 , ..., Y n ) be n i.i.d. training samples, where . The discriminant analysis problem is to predict whether an independent new sample z is from the X-class or Y -class. For a given (µ X , µ Y , Ω X , Ω Y ), Fisher's QDA rule can be written as In this section, we are going to find the asymptotic posterior distribution with some appropriate variance V 2 and some prior distribution. Since the result is conditional on the new observation z, we treat it as a fixed (non-random) vector in this section without loss of generality. Note that when Ω X = Ω Y is assumed, the QDA rule can be reduced to the LDA rule. We give general results for Bernstein-von Mises theorem to hold in both cases respectively.

Linear Discriminant Analysis
Assume Ω * X = Ω * Y . For a given prior Π, the posterior distribution for LDA is defined as where l n (µ X , µ Y , Ω) is the log-likelihood function decomposed as Define the following quantities , Assume A n is a set satisfying The main result for LDA is the following theorem.
If for a given prior Π, the following two conditions are satisfied:
A curious condition in the above theorem is V −1 = O(1). The following proposition shows it is implied by the separation of the two classes.
Proof. By the definition of V 2 , we have which is greater than a constant under the separation assumption.

Now we give examples of priors for LDA. Let us use independent priors. That is
independently. The prior for the whole parameter (Ω, µ X , µ Y ) is a product measure defined as Let Π Ω be the Gaussian prior defined in (6). Let both Π X and Π Y be N (0, I p×p ).
log n . The prior defined above satisfies the two conditions in Theorem 4.1 for some appropriate A n . Thus, the Bernstein-von Mises result holds.

Quadratic Discriminant Analysis
For the general case that Ω * X = Ω * Y may not be true, the posterior distribution for QDA is defined as We define the following quantities, , Assume A n is a set satisfying with some δ n = o(1). The main result for QDA is the following theorem.
If for a given prior Π, the following two conditions are satisfied: where Z ∼ N (0, 1) and the centering is∆ = ∆(X,Ȳ ,Σ −1 is also implied by the separation condition ||µ X −µ Y || > c by applying the same argument in Proposition 4.1.

Remark 4.2.
For independent prior in the sense that the posterior is also independent because of the decomposition of the likelihood. In this case, we have with A X,n and A Y,n being versions of A n involving only (µ X , Ω X ) and (µ Y , Ω Y ). In the same way, we also have .
Hence, for the two conditions in Theorem 4.3, it is sufficient to check and the corresponding conditions for Y , when the prior has an independent structure.
The example of prior we specify for QDA is similar to the one for LDA. Let us use independent priors. That is independently. The prior for the whole parameter (Ω Let Π Ω X and Π Ω Y be the Gaussian prior defined in Section 2.3.2. Let both Π X and Π Y be N (0, I p×p ).  In this section, we present the classical results for asymptotic normality of the estimators φ(Σ) and ψ(Σ −1 ). Note that in many cases, they coincide with MLE. The purpose is to compare them with the BvM results obtained in this paper. We first review and define some notation. Rememberσ ij is the (i, j)-th element ofΣ andω ij is the (i, j)-th element ofΣ −1 . We let ∆ L and ∆ Q be the LDA and QDA functionals respectively. The corresponding asymptotic variances are denoted by V 2 L and V 2 Q , defined in (7) and (8) respectively. As p, n → ∞ jointly, the asymptotic normality of φ(Σ) or ψ(Σ −1 ) holds under different asymptotic regimes for different functionals. For comparison, we assume that V L , V Q and the eigengap δ are at constant levels.
Theorem 5.1. Let p, n → ∞ jointly, then for any asymptotic regime of (p, n), Assume Since the above results are more or less scattered in the literature, we do not present their proofs in this paper. Readers who are interested can derive these results using delta method.
We remark that the condition p 2 /n = o(1) is sharp for (9)- (13). For (9) and (10), a common example is Since the functional ∆ L is harder than v T Ωv (the latter is a special case of the former if µ * X and µ * Y are known), p 2 /n = o(1) is also sharp for (13). For (11) and (12), we have the following proposition to show that p 2 /n = o(1) is necessary.
The condition p 3 /n = o(1) is sharp for (14) and (15). If p 3 /n = o(1) does not hold, a bias correction is necessary for (14) to hold (see [4]). That the condition p 3 /n = o(1) is necessary for (15) is because the functional ∆ Q contains the part log det(Σ).
In the next section, we are going to discuss the asymptotic regime of (p, n) for BvM and compare them with the frequentist results listed in this section.

The Asymptotic Regime of (p, n)
For all the BvM results we obtain in this paper, they assume different asymptotic regime of the sample size n and the dimension p. Ignoring the log n factor and assume constant eigengap δ and asymptotic variances for LDA and QDA, the asymptotic regime for (p, n) is summarized in the following table.
The table has three columns for the asymptotic normality of φ(Σ) and ψ(Σ −1 ) and for BvM with conjugate and non-conjugate priors respectively. The purpose is to compare our BvM result with the classical frequentist asymptotic normality. The priors are the Wishart prior and Gaussian prior we consider in this paper. For discriminant analysis, we did not consider conjugate prior because of limit of space. The conjugate prior in the LDA and QDA settings is the normal-Wishart prior. Its posterior distribution can be decomposed as a marginal Wishart times a conditional normal. The analysis of the BvM result for this case is direct, and we claim the asymptotic regimes for LDA and QDA are p 2 ≪ n and p 3 ≪ n respectively without giving a formal proof.
Comparing the first and the second columns, the condition for p and n we need for the BvM results with conjugate prior matches the conditions for the frequentist results. The two exceptions are σ ij and v T Σv, where for the frequentist asymptotic normality to hold, there is no assumption on p, n. Our technique of proof requires p ≪ n. This is because our theory requires a set A n ⊂ {||Σ − Σ * || ≤ δ n } for some δ n = o(1) to satisfy Π(A n |X n ) = 1 − o P (1). The best rate of convergence for ||Σ − Σ * || is p/n, which leads to p ≪ n. Such assumption may be weaken if a different theory than ours can be developed (or through direct calculation by taking advantage of the conjugacy).
The comparison of the second and the third columns suggests that using of non-conjugate prior requires stronger assumptions. We believe these stronger assumptions can all be weakened. The current stronger assumptions on p and n are caused the technique we use in this paper to prove posterior contraction, which is Condition 1 in Theorem 2.1 and Theorem 2.2. The current way of proving posterior contraction in nonparametric Bayes theory only allows loss functions which are at the same order of the Kullback-Leibler divergence. In the covariance matrix estimation setting, we can only deal with Frobenius loss. We choose For functionals of covariance such as σ ij and v T Σv, we need A n ⊂ {||Σ − Σ * || ≤ δ n } for some δ n . We have to bound ||Σ − Σ * || as and require M p 2 log n n ≤ δ n = o(1). This leads to p 2 ≪ n. For functionals of precision matrix, we need A n ⊂ { √ p||Σ − Σ * || ≤ δ n }. Again, we have bound

Covariance Priors
The general framework in Section 2 only considers prior defined on precision matrix Ω. However, sometimes it is more natural to use prior defined on covariance matrix Σ, for example, Gaussian prior on Σ. Then, the first conditions in Theorem 2.1 and Theorem 2.2 are hard to check. We propose a slight variation of this condition, so that our theory can also be user-friendly for covariance priors. We first consider approximate linear functionals of Σ satisfying (2). Then, the first condition of Theorem 2.1 can be replaced by Then we consider approximate linear functionals of Ω satisfying (4). The first condition of Theorem 2.2 can be replaced by With the new conditions, it is direct to check them for covariance priors by change of variable, as is done in the proof of Lemma 2.1 and Lemma 2.2. In particular, for the Gaussian prior on covariance matrix, we claim the conclusion of Lemma 2.2 holds. We avoid expanding the technical details for the covariance priors in this paper due to the limit of space.

Relation to Matrix Estimation under Non-Frobenius Loss
As we have mentioned in the end of Section 5.2, the current Bayes nonparametric technique for proving posterior contraction rate only covers losses which are at the same order of Kullback-Leiber divergence. It cannot handle other non-intrinsic loss [16]. In the Bayes matrix estimation setting, whether we can show the following conclusion for a general non-conjugate prior still remains open. This explains why there is so little literature in this field compared to the growing research using frequentist methods. See, for example, [5] and [6]. However, we observe that for the spectral norm loss, where N is a subset of S p−1 with cardinality bound log |N | ≤ cp for some c > 0. The BvM result we establish for the functional v T Σv indicates that for each v, the posterior distribution of |v T (Σ − Σ * )v| is at the order of n −1/2 . Therefore, heuristically, 2 sup v∈N |v T (Σ − Σ * )v| should be at the order of √ log |N | √ n , which is p/n. We will use this intuition as a key idea in our future research project on the topic of Bayes matrix estimation.
Once (16) is established for a non-conjugate prior (e.g. Gaussian prior in this paper), then we may use (16) to weaken the conditions in the third column of the table in Section 5.2. In fact, most entries of that column can be weakened to match the conditions in the second column for a conjugate prior. As argued in Section 5. Before stating the proofs, we first display some lemmas. The following lemma is Lemma 2 in [10]. It allows us to prove BvM results through convergence of moment generating functions. Lemma 6.1. Consider the random probability measure P n and a fixed probability measure P . Suppose for any real t, the Laplace transformation e tx dP (x) is finite, and e tx dP n (x) → e tx dP (x) in probability. Then, it holds that The next lemma is an expansion of the Gaussian likelihood.
the following equation holds for all Ω ∈ A n with A n satisfying (1) or (3).
. Then, for any t > 0, Proof of Theorem 2.1. We are going to use Lemma 6.1 and establish the convergence of moment generating function. We claim that uniformly over A n . The derivation of (18) will be given at the end of the proof. Define the posterior distribution conditioning on A n by Π An (B|X n ) = Π(A n ∩ B|X n ) Π(A n |X n ) , for any B.
It is easy to see sup by the first condition of Theorem 2.1. Now we calculation the moment generating function of √ n φ(Σ)−φ(Σ) An exp l n (Ω) dΠ(Ω) An exp l n (Ω) dΠ(Ω) where the second equality is because of (18) and the last inequality is because of the second condition of Theorem 2.1. We have shown that the moment generating function of √ n φ(Σ)−φ(Σ) √ 2 Σ * 1/2 ΦΣ * 1/2 F under the distribution Π An (·|X n ) converges to the moment generating function of N 0, 1 in probability. By Lemma 6.1 and (19), we have established the desired result.
To finish the proof, let us derive (18). Using the result of the likelihood expansion in Lemma 6.2, we will first show where the o(1) above is uniform on A n . Compare (20) with (17) in Lemma 6.2, it is sufficient to bound We use the following argument to bound R 1 on A n .
Proof of Theorem 2.2. We follow the reasoning in the proof of Theorem 2.1 and omit some similar steps. Define Φ = −Ω * ΨΩ * .
It is easy to see that Ω * 1/2 ΨΩ * 1/2 Then by Lemma 6.2 and the similar arguments in the proof of Theorem 2.1, we obtain uniformly on A n , which is analogous to (20). We are going to approximate √ ntr (Σ −Σ)Φ by √ n ψ(Ω) − ψ(Σ −1 ) on A n . DefineΩ =Σ −1 . The assumption rp 2 /n = o(1) implies that p/n = o(1). Thus,Ω is well defined. By Lemma 6.3, Using notation V = 2 Ω * 1/2 ΨΩ * 1/2 2 F , the approximation error on A n is Let the singular value decomposition of Ψ be Ψ = r l=1 d l q l q T l . Then, Similarly, uniformly on A n , where we have used (23) in the second last inequality above. Hence, uniformly on A n . The remaining part of the proof are the same as the corresponding steps in the proof of Theorem 2.1. Thus, the proof is complete.

Proof of Lemma 2.1
Proof of Lemma 2.1. The proof has two parts. In the first part, we establish the first condition of the two theorems by proving a posterior contraction rate. In the second part, we establish the second condition of the two theorems by showing that a change of variable is negligible under Wishart density. Part I. The posterior distribution Ω|X n is W p (nΣ + I) −1 , n + p + b − 1 . Conditioning on X n , let Z l |X n ∼ P (nΣ+I) −1 i.i.d. for each l = 1, 2, ..., n + p + b − 1. Then the posterior distribution of Ω is identical to the distribution of n+p+b−1 l=1 Z l Z T l X n . Define the set and we have P n Σ * (G c n ) ≤ exp − cp by Lemma 6.3, for some c, C > 0. The event G n implies ||Σ − Σ * || ≤ C||Σ * || p n , by which we can deduce Using the obtained results, we can bound the deviation of the sample covariance by and the posterior deviation can be bounded by where we use W l ∼ N (0, I) in the above equations. In summary, we have proved Part II. Note that the proof for this part is the same for both Theorem 2.1 and Theorem 2.2 by letting Φ = −Ω * ΨΩ * . We introduce the notatioñ Now we study the integral An exp l n (Ω t ) dΠ(Ω). Let N (p, b) be the normalizing constant of W p (I, p + b − 1). We have An exp l n (Ω t ) dΠ(Ω) The above integrals are meaningful because A n ∪ A n + 2tn −1/2Φ ⊂ {Ω : Ω > 0, Ω = Ω T }.

Note that
are also true when M ′ , M, M ′′ are large enough. Let ||Φ|| N be the nuclear norm ofΦ, defined as the sum of its absolute eigenvalues. Note that on A ′′ n , we have

Proof of Lemma 2.2
Now we are going to prove Lemma 2.2. Like the proof of Lemma 2.1, it has two parts. The first part is to show posterior contraction on some appropriate set A n . Note that Wishart prior is a conjugate prior. The posterior contraction can be directly calculated. For the Gaussian prior, its non-conjugacy requires to apply some general result from nonparametric Bayes theory. To be specific, we follow the testing approach in [2] and [15]. The outline of using testing approach to prove posterior contraction for Bayesian matrix estimation is referred to Section 5 in [14]. We first state some lemmas.
for some constant C > 0.
The next lemma is Lemma 5.1 in [14].
Then for any b > 0, we have for some constant C > 0.
The next lemma is Lemma 5.9 in [14].

Proof of Lemma 2.2.
Like what we have done in the Wishart case, the proof has two parts.
In the first part, we establish the first condition of the two theorems by proving a posterior contraction rate. In the second part, we establish the second condition of the two theorems by showing that a change of variable is negligible under Gaussian density. Part I. Define Let us establish a testing between the following hypotheses: We choose the smallest N , which is determined by the covering number. Since A c n ∩supp(Π) ⊂ By Lemma 6.6, there exists φ j such that Define φ = max 1≤j≤N φ j . Using union bound to control the testing error, we have for sufficiently large M . We bound Π(A c n |X n ) by +P n Σ * φ + P n Σ * D n ≤ exp(−2p 2 log n) .
In the upper bound above, the first two terms are bounded by the testing error we have established. The last term can be bounded by combining the results of Lemma 6.4 and Lemma 6.5. Hence, we have proved that For Part II. Let Π G induce a prior distribution on symmetric Ω with each of the upper triangular element independently following N (0, 1). The density of Π G is where we useΩ to zero out the lower triangular elements of Ω except the diagonal part and ξ p is the normalizing constant. Write Remembering the notationΦ defined in the proof of Lemma 2.1, we have We may choose M ′ , M ′′ arbitrarily close to M such that M ′ < M < M ′′ and A ′ n ⊂ A n + 2tn −1/2Φ ⊂ A ′′ n for This can always be done because Therefore, using the same argument in the proof of Lemma 2.1, we have An exp l n (Ω t ) dΠ(Ω) An exp l n (Ω) dΠ(Ω) This completes the proof.

Proof of Technical Lemmas
Proof of Lemma 6.2. First, we show Ω t is a valid precision matrix under the event A n , i.e., Ω t > 0. Using Weyl's theorem, we have where the first term is bounded by Under the current assumption, Knowing the fact that l n (Ω t ) is well-defined, we study l n (Ω t ) − l n (Ω), Let {h j } p j=1 be eigenvalues of Σ 1/2 (Ω − Ω t )Σ 1/2 . Then, we have (1−s) 3 ds is the remainder of the Taylor expansion. Therefore, we have obtained the expansion The proof is complete.
Proof of Lemma 6.4. Define Π G to be the distribution which specifies i.i.d. N (0, 1) on the upper triangular part of Ω and then take the lower triangular part to satisfy Ω T = Ω. Define Then according to the definition of Π, we have , for any B.
, for any B.
Since p 2 /n = o(1), we have Thus, Calculate using Gaussian density directly, for example, according to Lemma E.1 in [14], and we have where Z ∼ N (0, 1). The proof is complete by observing that ||Ω * || 2 F = o(p 2 log n) under the assumption.
uniformly on A n .
Proof. Since we expand both quantities in the brackets using the general notation l(µ t , Ω t )−l(µ, Ω). Using Taylor expansion as in the proof of Lemma 6.2 and the notationΣ = 1 where {h j } p j=1 are eigenvalues of Σ 1/2 (Ω − Ω t )Σ 1/2 . The same proof in Lemma 6.2 implies Therefore, We approximate 2t √ n V tr (Σ−Σ)Φ , and the approximation error is bounded by under A n and the assumption p 2 /n = o(1), where we have used the fact that ||Φ||/V ≤ C.
T Ω * ξ X , and the difference is bounded by under A n and the fact that ||ξ X ||/V ≤ C. Using the same argument, we can also approximate T Ω * ξ Y . Now we approximate the quadratic terms. Using the same argument in the proof of Lemma 6.2, we have We also have and the same bound for . Therefore, on A n . The proof is complete by considering all the approximations above.
uniformly on A n .
uniformly in A n . The remaining of the proof is the same as the proof of Theorem 2.1.
The proof of Theorem 4.3, is very similar to the proof of Theorem 4.1. We simply state the technical steps in the following lemmas and omit the details of the proof.
uniformly on A n .
Lemma A.4. Under the same setting of Lemma A.3 and further assume V −1 = O(1) and uniformly on A n .
Proof of Theorem 4.3. Combining Lemma A.3 and Lemma A.4, we have uniformly in A n . The remaining of the proof is the same as the proof of Theorem 2.1.

B Proof of Theorem 4.2 & Theorem 4.4
In this section, we are going to prove Theorem 4.2 and Theorem 4.4. Due to the similarity of the two theorems, we only present the details of the proof of Theorem 4.4. The proof of Theorem 4.2 will be outlined. By the remark after Theorem 4.4, it is sufficient to check the two conditions in Theorem 4.4 for X and Y separately. Therefore, we only prove for the X part and omit the subscript X from now on. Denote the prior for (Ω, µ) as Π = Π Ω × Π µ . The following lemma is a generalization of Lemma 6.5 to the nonzero mean case.
Lemma B.1. Let ǫ be any sequence such that ǫ → 0. Define for some constant C > 0.
Proof. We renormalize the prior Π asΠ = Π(K n ) −1Π so thatΠ is a distribution with support within K n . Write EΠ to be the expectation using probabilityΠ. Define the random variable for i = 1, ..., n, where c is a constant independent of X 1 , ..., X n . Then, Y i is a sub-exponential random variable with mean Thus, by Jensen's inequality, we have where in the last equality we defined for i = 1, ..., n. By union bound, we have In the proof of Lemma 5.1 of [14], we have shown that Hence, it is sufficient to bound the second term. Define Z i = Ω * 1/2 (X i − µ * ), and then we have with a = Σ * 1/2 EΠ Ω(µ − µ * ) . By Bernstein's inequality (see, for example, Proposition 5.16 of [28]), we have Since and ||a|| ∞ ≤ ||a|| ≤ √ C ′ ǫ 2 , then because ǫ → 0. The conclusion follows the fact that The following lemma proves prior concentration.
for some constant C > 0.
Proof. We have where the first term is lower bounded in Lemma 6.4. It is sufficient to lower bound Π µ 4Λ||µ− . By the definition of Gaussian density, for some constant C > 0.
Proof. Use notation ǫ 2 = p 2 log n/n. Consider the testing function Then we have where Z ∼ N (0, I p×p ). We also have for any (µ, Ω) in the alternative set, Finally, it is sufficient to bound P ||Z|| 2 ≥ CM 2 nǫ 2 . We have where we have used Bernstein's inequality. The proof is complete.
Lemma B.4. Assume ||Σ * || ∨ ||Ω * || ≤ Λ = O(1) and ||Σ 1 || ∨ ||Ω 1 || ≤ 2Λ. There exist small δ, δ ′ ,δ > 0 only depending on Λ such that for any M > 0, there exists a testing function φ such that for some constant C > 0, whenever 6ΛM 2 ǫ 2 ≤δ||Σ 1 − Σ * || 2 F . Proof. Since the lemma is a slight variation of Lemma 5.9 in [14]. We do not write the proof in full details. We choose to highlight the part where the current form is different from that in [14], and omit the similar part where the readers may find its full details in the proof of Lemma 5.9 in Gao and Zhou. We use the testing function We immediately have as is proved in [14]. Now we are going to bound P n (µ,Ω) (1−φ) for every (µ, Ω) in the alternative set. Note that we have where we have proved in [14] thatρ for someδ only depending on Λ. Using union bound, we have [14] showed that the first term above is bounded by 2 exp − Cδ ′ ||Σ 1 − Σ * || 2 F . It is sufficient to bound the second term to close the proof. Actually, this is the only difference between this proof and the one in [14]. Note that By assumption, Hence, where Z ∼ N (0, 1) and a = Σ 1/2 (Ω * − Ω)(µ − µ * ). Using Hoeffding's inequality (see, for example, Proposition 5.10 of [28]), we have according to the assumption. Thus, for some δ ′ only depending on Λ. Therefore, P n (µ,Ω) (1 − φ) ≤ exp − Cδ ′ ||Σ 1 − Σ * || 2 F for all (µ, Ω) in the alternative set and the proof is complete.
Proof of Theorem 4.2 and Theorem 4.4. According to the remark after Theorem 4.3, Π(A n |X n , Y n ) = Π X (A X,n |X n )Π Y (A Y,n |Y n ).
Thus, it is sufficient to show both Π X (A X,n |X n ) and Π Y (A Y,n |Y n ) converge to 1 in probability. Since they have the same form, we treat them together by omitting the subscript X and Y . The posterior distribution is defined as Π(A c n |X n ) = where we consider A n = ||µ − µ * || ≤ M p 2 log n n , ||Σ − Σ * || F ≤M p 2 log n n , for some M andM sufficiently large. We are going to establish a test between the following hypotheses: H 0 : (µ, Ω) = (µ * , Ω * ) vs H 1 : (µ, Ω) ∈ A c n ∩ supp(Π). Decompose A c n as A c n = B 1n ∪ B 2n , where B 1n = ||µ − µ * || > M p 2 log n n , and B 2n = ||µ − µ * || ≤ M p 2 log n n , ||Σ − Σ * || F >M p 2 log n n .

D Proof of Lemma 3.2, Lemma 3.3 & Proposition 5.1
Due to the similarity between Lemma 3.2 and Lemma 3.3, we only give the proof of Lemma 3.2. Let us study the linear approximation of eigenvalue perturbation. In particular, we are going to find the first-order Taylor expansion of λ m (Σ) − λ m (Σ) and control the error term in some set A n . We have the following spectral decomposition for the three covariance matrices Σ,Σ, Σ * . Σ = U DU T ,Σ =ÛDÛ T , Σ * = U * D * U * T .
Clearly, each the cardinality of each subset is We give names to the sets we have mentioned by V k+1 = ∪ k+1 l=1 V k+1,l .
For l = 1, we have In the same way, the bound also holds for other l. Therefore,