Bayesian inference for spectral projectors of covariance matrix

Let $X_1, \ldots, X_n$ be i.i.d. sample in $\mathbb{R}^p$ with zero mean and the covariance matrix $\mathbf{\Sigma^*}$. The classic principal component analysis estimates the projector $\mathbf{P^*_{\mathcal{J}}}$ onto the direct sum of some eigenspaces of $\mathbf{\Sigma^*}$ by its empirical counterpart $\mathbf{\widehat{P}_{\mathcal{J}}}$. Recent papers [Koltchinskii, Lounici (2017)], [Naumov et al. (2017)] investigate the asymptotic distribution of the Frobenius distance between the projectors $\| \mathbf{\widehat{P}_{\mathcal{J}}} - \mathbf{P^*_{\mathcal{J}}} \|_2$. The problem arises when one tries to build a confidence set for the true projector effectively. We consider the problem from Bayesian perspective and derive an approximation for the posterior distribution of the Frobenius distance between projectors. The derived theorems hold true for non-Gaussian data: the only assumption that we impose is the concentration of the sample covariance $\mathbf{\widehat{\Sigma}}$ in a vicinity of $\mathbf{\Sigma^*}$. The obtained results are applied to construction of sharp confidence sets for the true projector. Numerical simulations illustrate good performance of the proposed procedure even on non-Gaussian data in quite challenging regime.


Introduction
Let the observed data X n = (X 1 , . . . , X n ) be a collection of independent identically distributed zero-mean random vectors in R p and let X be a generic where ξ is a zero mean Gaussian vector with a specific covariance structure and ♦ is an explicit error term. The similar approximation is obtained in the bootstrap world, this reduces the original problem to the question about Gaussian comparison and Gaussian anti-concentration for large balls.
This paper suggests to look at this problem from a Bayesian point of view. The standard approach for a nonparametric analysis of the posterior distribution is based on the prominent Bernstein -von Mises (BvM) phenomenon. The BvM result states some pivotal (Gaussian) behavior of the posterior. The paper [8] developed a general framework for functional BvM theorem, while [35] used similar ideas to demonstrate asymptotic normality of approximately linear functionals of covariance and precision matrices. In particular, it can be used to justify the use of Bayesian credible sets as frequentist confidence sets for the target parameter; see [25,32,15,20,4,7] among others. In this work, we aim to address a similar question specifically for spectral projectors of the covariance matrix. It appears that the general BvM technique can be significantly improved and refined for the problem at hand. The use of the classical conjugated Wishart prior helps not only to build a numerically efficient procedure but also to establish precise finite sample results for the posterior credible sets under mild and general assumptions on the data distribution. The key observation here is that, similarly to the bootstrap approach of [27], the credible level sets for the posterior are nearly elliptic, and the corresponding posterior probability can be approximated by a generalized chi-squared-type distribution. This allows to apply the recent "large ball probability" bounds on Gaussian comparison and Gaussian anti-concentration from [16]. Moreover, in the contrary to the latter paper [27], we do not require a Gaussian distribution of the data. Our results claim that the posterior credible sets can be used as frequentist confidence regions even under a possible model misspecification when the true data generation measure is not Gaussian. We still work with the Gaussian likelihood, in this sense our procedure is pseudo-Bayesian and the constructed credible sets should be referred to as pseudo-posterior ; see [14]. In our study we allow the dimension p to grow with the sample size n , however, we need to assume " p 3 /n is small" in order to make our results meaningful.
The main contributions of this paper are as follows.
• We offer a new procedure for building elliptic confidence sets for the true projector based on Bayesian simulation from the Inverse Wishart prior. The procedure is fully data-driven and numerically efficient, its complexity is proportional to the squared dimension and independent of sample size. Numerical simulations confirm good performance of the proposed method for artificial data: both Gaussian and non-Gaussian (not even sub-Gaussian). • We establish novel results on the coverage properties of pseudo-posterior credible sets for a complicated non-linear problem of recovering the eigenspace of the sample covariance matrix. The results apply under mild conditions on the data distribution. In particular, we do not require Gaussianity of the observations. The rest of the paper is structured as follows. Some notations are introduced in Section 2.1. Section 2.2 discusses the model. Our pseudo-Bayesian framework and the main result of the paper about the pseudo-posterior credible sets are described in Section 2.3. The use of such sets as frequentist confidence sets is discussed in Section 2.4. Some numerical results on simulated data are demonstrated in Section 3. Section 4 contains the proofs of the main theorems. Some auxiliary results from the literature and the rest of the proofs are collected in Appendix A and Appendix B, respectively.

Problem and main results
This section explains our setup and states the main results.

Notations
We will use the following notations throughout the paper. The space of realvalued p × p matrices is denoted by R p×p , while S p + means the set of positivesemidefinite matrices. We write I d for the identity matrix of size d×d , rank(A) and Tr(B) stand for the rank of a matrix A and the trace of a square matrix B . Further, A ∞ stands for the spectral norm of a matrix A , while A 1 means the nuclear norm. The Frobenius scalar product of two matrices A and B of the same size is A, B 2 def = Tr(A B) , while the Frobenius norm is denoted by A 2 . When applied to a vector, · means just its Euclidean norm. The effective rank of a square matrix B is defined by r(B) The relation a b means that there exists an absolute constant C , different from line to line, such that a ≤ Cb , while a b means that a b and b a . By a ∨ b and a ∧ b we mean maximum and minimum of a and b , respectively. In the sequel we will often be considering intersections of events of probability greater than 1 − 1/n . Without loss of generality, we will write that the probability measure of such an intersection is 1 − 1/n , since it can be easily achieved by adjusting constants. Throughout the paper we assume that p < n .

Setup and problem
Let X 1 , . . . , X n be i.i.d. zero mean with Var(X) = Σ * . Without loss of generality, we can assume that Σ * ∈ S p + is invertible, otherwise one can easily transform the data in such a way that the covariance matrix for the transformed data will be invertible. Let σ * 1 ≥ . . . ≥ σ * p be the ordered eigenvalues of Σ * . Suppose that among them there are q distinct eigenvalues μ * 1 > . . . > μ * q . Introduce groups of indices Δ * r = {j : μ * r = σ * j } and denote by m * r the multiplicity factor (dimension) |Δ * r | for all r ∈ {1, . . . , q} . The corresponding eigenvectors are denoted as u * 1 , . . . , u * p . We will use the projector on the r -th eigenspace of dimension m * r : We also introduce the spectral gaps g * r : Suppose that Σ has p eigenvalues σ 1 > . . . > σ p (distinct with probability one). The corresponding eigenvectors are denoted as u 1 , . . . , u p . Suppose that Then, as shown in [21], we can identify clusters of the eigenvalues of Σ corresponding to each eigenvalue of Σ * and therefore determine Δ * r and m * r for all r ∈ {1, . . . , q} . Then we can define the sample projector on the r -th eigenspace of dimension m * r : Under the condition that the spectral gap is sufficiently large, [27] approximated the distribution of n P r − P * r 2 2 by the distribution of a Gaussian quadratic form ξ 2 with ξ ∼ N(0, Γ * r ) and Γ * r is a block-matrix of the form Below we extend these result in two aspects. First, our approach allows to pick a block of eigenspaces corresponding to an interval J in {1, . . . , q} from r − to r + . Second, we relax the assumption on Gaussianity of the data. Let Define also the subset of indices and introduce the projector onto the direct sum of the eigenspaces associated with P * r for all r ∈ J : Its empirical counterpart is given by For instance, when J = {1, . . . , q eff } for some q eff < q , then P J is exactly what is recovered by PCA. Below we focus on n P J − P * J 2 2 rather than n P r − P * r 2 2 .

I. Silin and V. Spokoiny
The projector dimension for J is given by m * J = r∈J m * r . Its spectral gap can be defined as Define also for J = {r − , r − + 1, . . . , r + } It is easy to notice that when J = {r} then this definition coincides with (2.1).
Our results apply under one rather mild and natural condition on the distribution of X n = (X 1 , . . . , X n ) that the sample covariance matrix Σ concentrates around the true covariance Σ * : with probability 1 − 1/n . In this result we do not use, say, independence of the X i 's or zero mean property, everything is done conditioned on the data X n . The value δ n clearly depends on the underlying data distributions, but it allows to work with much wider classes of probability measures rather than just Gaussian or sub-Gaussian. For the Gaussian case one may take Several more examples of possible distributions and the corresponding δ n for them are provided in Appendix A, see Theorem A.1. So, throughout the rest of the paper we assume that the data satisfy condition (2.3).

Pseudo-Bayesian framework and credible level sets
Let Π be a prior distribution on the set of considered covariance matrices Σ . Even though our data are not necessary Gaussian, we can consider the Gaussian log-likelihood: In case of Gaussian data, the posterior measure of a set A ⊂ S p + can be expressed as However, we can study this random measure for non-Gaussian data as well. As the Gaussian log-likelihood n (Σ) does not necessarily correspond to the true distribution of our data, we call the random measure Π · X n a pseudoposterior, [14]. Once a prior is fixed, we can easily sample matrices Σ from this pseudo-posterior distribution. Denote eigenvalues of Σ as σ 1 > . . . > σ p (assume they are distinct with probability one) and eigenvectors as u 1 , . . . , u p . The corresponding projector onto the r -th eigenspace of dimension m * r is and the projector on the direct sum of eigenspaces associated with P r for r ∈ J is In this work we focus on the conjugate prior to the multivariate Gaussian distribution, that is, the Inverse Wishart distribution IW p (G, p + b − 1) with G ∈ S p + , 0 < b p . Its density is given by Some nice properties of the Inverse Wishart prior distribution allow us to obtain the following result which we will use for uncertainty quantification statements in the next section instead of the Bernstein-von Mises Theorem.

Consider the prior Π(Σ) given by the Inverse Wishart distribution
Then with probability 1 − 1

Gaussian approximation and frequentist uncertainty quantification for spectral projectors
For the Gaussian data, Theorem 4.3 of [27] provides the explicit error bound (1.1) with the error term ♦ of the following form: The goal of this section is to extend this result to include the case of a generalized spectral cluster and of non-Gaussian data. Before formulating the result, let us introduce the following auxiliary matrices Coupled with Theorem A.1, (ii), this allows to bound the error term ♦ in the sub-Gaussian case as where, for simplicity, the characteristics of Σ * are hidden in the constants.
The proof of this result is presented in Appendix B. The obtained bound is worse than (2.5) when the the full dimension p is large. This is the payment for the Gaussian approximation which appears for non-Gaussian data. Our result makes use of the Gaussian approximation technique from [2]. Some recent developments in Gaussian approximation for a probability of a ball indicate that the bound (2.6) can be improved even further; see [30]. Comparison of the results of Theorem 2.1 and Theorem 2.2 reveals that the pseudo-posterior distribution of n P J − P J 2 2 given the data perfectly mimics the distribution of n P J − P * J 2 2 , and, therefore, can be applied to building of elliptic confidence sets for the true projector. Specifically, for any significance level α ∈ (0; 1) (or confidence level 1 − α ) we can estimate the true quantile by the following counterpart which can be numerically assessed using Bayesian credible sets: Then, the main results presented above imply the following corollary.

Numerical experiments
This section shows by mean of artificial data that the proposed pseudo-Bayesian approach works quite well even for large data dimension and limited sample size.
We also want to track how the quality depends on the sample size n and the dimension p . In our experiments we first fix some true covariance matrix Σ * of size p × p . Without loss of generality we consider only diagonal matrices Σ * , so Σ * is defined by the distinct eigenvalues μ * r and the multiplicities m * r . We also specify the desired subspace that we want to investigate by fixing J . Further, for different sample sizes n we repeat the following two-step procedure. The first step is to determine the quantiles of n P J − P * J 2 2 . For that we generate 3000 samples X n , compute the corresponding P J and then just take α− quantiles of the obtained realizations n P J −P * J 2 2 for α from 0.001 to 0.999 with step 0.001 . The second step is to estimate the quantiles of the pseudo-posterior distribution of n P J − P J 2 2 . We generate 50 samples X n and for each realization we generate 3000 pseudo-posterior covariance matrices Σ from the Inverse Wishart distribution with G = I p , b = 1 . Then we compute the corresponding P J and take the α− quantiles of n P J − P J 2 2 just as in the first step. Namely, for each α we get 50 quantile estimates γ • α (j) , j ∈ {1, . . . , 50} (suppose we order them in ascending order) and take median of them. For the true quantiles from the first step and the medians of the quantile estimates from the second step we build a QQ-plot, which consists of points with coordinates γ α , γ • α (25) for various α . We expect that the constructed QQ-plot is close to the identity line indicating that these two distributions are close to each other. Also we present a table with median coverage probabilities (25) and interquartile ranges (12) of this coverage probability for the desired confidence levels 1 − α from the list • J = {1} , so we investigate the one-dimensional principal subspace given by P * 1 . The QQ-plots are depicted on Figure 1 while the coverage probabilities and the interquartile ranges are presented in Table 1.
The setup of this experiment is exactly the same as the second example of [27], so the performance of our pseudo-Bayesian method can be directly compared with the performance of the Bootstrap approach (cf. Figure 2 and Table 2 of [27]). The accuracy of these two procedures is approximately the same.
In the second experiment we check how our method performs on non-Gaussian data. We generate each component of the vectors X j independently yielding diagonal covariance matrix. In addition to Gaussian distribution, we consider also the following three options: the uniform distribution on the interval [−a; a] , the Laplace distribution with scaling parameter a and the discrete uniform distribution with three values {−a, 0, a} . In each case the parameter a is chosen in such a way that ensures the variance located on the diagonal of the covariance matrix fixed earlier. So, the parameters of the experiment are as follows: • p = 100 , m * 1 = 3 , m * 2 = 3 , m * 3 = 3 and the rest of the multiplicities m * 4 , . . . , m * 91 are one. • μ * 1 = 25 , μ * 2 = 20 , μ * 3 = 15 , μ * 4 = 10 , μ * 5 = 7.5 , μ * 6 = 5 and the rest of the eigenvalues μ * 7 , . . . , μ * 100 are from the uniform distribution on [0; 3] . • The first nine components were generated according to: uniform, Laplace, discrete, Gaussian, Laplace, discrete, Laplace, Laplace, uniform distributions, respectively. The rest of the components are Gaussian. • J = {1, 2, 3} , so we investigate nine-dimensional subspace given by P * 1 + P * 2 + P * 3 . The QQ-plots are depicted on Figure 2 while the coverage probabilities and the interquartile ranges are presented in Table 2. The performance of the proposed procedure is very good except the case when the sample size is of the same order as the dimension. However, this regime lies beyond the scope of our results. If we have enough data, the method demonstrates very good results even in such challenging situations as recovering a direct sum of several subspaces from non-Gaussian (even not sub-Gaussian) data.

Main proofs
This section collects the proofs of the main results. Some additional technical statements are postponed to the Appendix.

Proof of Theorem 2.1
The Inverse Wishart prior IW p (G, p + b − 1) is conjugate to the multivariate Gaussian distribution, so our pseudo-posterior Π Σ X n is IW p (G + n Σ, n + p + b − 1) . We will actively use the following well-known property of the Wishart distribution: For shortness in this section we will use the notation n p def = n + p + b − 1 and we assume that b p . As we will see, this assumption will help us to simplify the bounds, while the case b p does not bring any gain. Moreover, define where Z j X n i.i.d. ∼ N(0, I p ) . Then Σ −1 X n can be represented as We may think that in the "posterior" world all randomness comes from E n,p . Moreover, due to Theorem A.1, (i), there is a random set Υ such that on this set E n,p ∞ log(n p ) + p n p ≤ log(n) + p n , and its pseudo-posterior measure Step 1 First, we will need the following lemma.
Lemma 4.1. The following holds on the random set Υ : Note that Hence, Finally, the observations that finish the proof.
The condition on the significant spectral gap for Σ * and the bound (2.3) on the operator norm Σ − Σ * imply a significant spectral gap for the empirical covariance Σ . The crucial Lemma A.2 applied with the central projector P J in place of P * J allows to obtain the bound on how close the linear operator

Lemma 4.2.
The following holds on the random set Υ : where helps to obtain the next result showing that L J (Σ − Σ) can be approximated

Lemma 4.3. It holds
where the remainder R J fulfills on the random set Υ Proof. Define R n,p by Its spectral norm can be bounded as Therefore for Σ − Σ we have From Σ 1/2 n,p E n,p Σ 1/2 n,p we pass to Σ where we introduce the remainder terms They can be bounded as Hence, omitting higher order terms, on Υ we have

I. Silin and V. Spokoiny
Moreover, which provides the desired bound. Similarly, we have where the last inequality holds on Υ .
The results of Lemmas 4.2 and 4.3 yield on the random set Υ √ n P J − P J − S J 2 Δ 0 + Δ 1 .
In addition, Thus, taking into account the bound for S J 2 and neglecting higher order terms, on Υ we obtain where Step 2 The norm n S J 2 2 can be decomposed as follows: for k ∈ I J , l ∈ I J , ordered in some particular way that will become clear later.
Note that n S J 2 2 = ξ J 2 . Clearly, for each k ≤ p and j ≤ n p Then the components can be rewritten as for k ∈ I J , l ∈ I J . To understand the covariance structure of ξ J , consider one more pair (k , l ) and investigate the covariance: with δ k,k = 1(k = k ) . Therefore, the covariance matrix of ξ J is diagonal: Let us fix arbitrary k ∈ I J , l / ∈ I J and upperbound the corresponding term of the sum. We will extensively use | σ Since n/n p ≤ 1 , the first term is controlled by where we introduced ε k = σ k − σ * k and ε l = σ l − σ * l . Then, crossing out the term σ * k σ * l (σ * k − σ * l ) 2 in the numerator, we obtain On the event where Σ − Σ * ∞ ≤ g * J /4 we have |ε k |, |ε l | ≤ |σ * k − σ * l |/4 , therefore we can omit the terms ε k ε l and (ε k − ε l ) 2 paying a constant factor for that: As to the denominator, again considering the event where Σ − Σ * Hence, As to the term The last inequality uses the fact that |σ * k − σ * l | ≤ Σ * ∞ which is rather useless, however it allows to write the final bound in a convenient form and doesn't worsen the result.
Putting this all together, we get which provides the desired result once we notice that be bounded by both 2 p Tr(Σ * 2 ) and 2 p m * J Σ * ∞ .
Unfortunately, the entries ξ k,l of ξ J are not Gaussian because of the product η k,j η l,j . This does not allow to apply the Gaussian comparison Lemma A.4. To get rid of this issue, we condition on P J Z . Namely, in the "posterior" world random vectors P J Z j and (I p − P J )Z j are Gaussian and uncorrelated, therefore, independent, so we can condition on Z J def = ( P J Z 1 , . . . , P J Z np ) to get that S J is conditionally on X n , Z J Gaussian random vector with the covariance matrix It holds similarly to the above with probability one. Now we combine the obtained bounds. For Δ 2 defined above and arbitrary x ∈ R it holds Since n S J 2 2 X n d = ξ J 2 X n , Δ 2 Δ 2 with probability 1− 1 n , and taking (4.2) into account, we deduce with probability 1 − 1 n . Subtracting P ξ J 2 ≤ x and taking supremum of both sides, we get The first term in the right-hand side is bounded by By definition of γ • α the previous two inequalities yield Hence, Finally, these inequalities imply the following bound which concludes the proof.
(i) Gaussian distribution N (0, Σ * ) . In this case, define δ n as δ n r(Σ * ) + log(n) n ; (ii) Sub-Gaussian distribution. In this case, define δ n as δ n p + log(n) n ; (iii) a distribution supported in some centered Euclidean ball of radius R . In this case, define δ n as δ n R Σ * log(n) n ; (iv) log-concave probability measure. In this case, define δ n as δ n log 6 (n) np .
Then in all the cases above the following concentration result for Σ holds with the corresponding δ n : with probability at least 1 − 1 n . Proof. (i) See [22], Corollary 2. (ii) This is a well-known simple result presented in a range of papers and lecture notes. See, e.g. [29], Theorem 4.6. (iii) See [33], Corollary 5.52. Usually the radius R is taken such that The following lemma is a crucial tool when working with spectral projectors.
One more lemma from [16] describes how close are the distributions of the norms of two Gaussian elements in terms of their covariance operators. Note that the bound is dimension free.

Lemma A.4 (Gaussian comparison). Let ξ and η be Gaussian elements in
Hilbert space H with zero mean and covariance operators Σ ξ and Σ η , respectively. The following inequality holds Proof. See [16], Theorem 2.1.

B.1. Proof of Theorem 2.2
The proof consists of three steps.