Robust PCA and pairs of projections in a Hilbert space

Abstract: This is a study of principal component analysis performed on a statistical sample. We assume that this data sample is made of independent copies of some random variable ranging in a separable real Hilbert space. This covers data in function spaces as well as data represented in reproducing kernel Hilbert spaces. Based on some new inequalities about the perturbation of nonnegative self-adjoint operators, we provide new bounds for the statistical fluctuations of the principal component representation with the draw of the statistical sample. We suggest two kinds of improvements to decrease these fluctuations: the first is to use a robust estimate of the covariance operator, for which non-asymptotic bounds of the estimation error are available under weak polynomial moment assumptions. The second improvement is to use some modification of the projection on the principal components based on functional calculus applied to the covariance operator. Using this modified projection, we can obtain bounds that do not depend on the spectral gap but on some more favorable factor. In appendix, we provide a new approach to the analysis of the relative positions of two orthogonal projections that is useful for our proofs and that has an interest of its own.


Introduction
Principal Component Analysis (PCA) is a classical tool for dimensionality reduction that relies on the spectral properties of the covariance matrix. In this paper we consider a data set (X 1 , . . . , X n ) made of independent copies of a random variable X taking its values in a real separable Hilbert space H.
The basic idea of PCA is to reduce the dimensionality of X by projecting it into a finite dimension linear subspace while keeping the variance as high as possible. This subspace, as is well known, is the linear span of the eigenvectors of the covariance operator associated with the largest eigenvalues and called the principal components of X.
Several results can be found in the literature concerning the non-asymptotic setting. These results rely on sharp non-asymptotic bounds for the approximation error of the covariance matrix (e.g. Rudelson [18], Tropp [22], Vershynin [23]).
PCA in a separable Hilbert space that we consider here includes the analysis of samples in a functional space (PCA for functional data, Ramsay and Silverman [17]) and of samples embedded in a reproducing kernel Hilbert space. The latter is for example the case of kernel PCA, that uses the kernel trick to embed the dataset in a reproducing kernel Hilbert space in order to get a representation with a simplified geometry. (e.g. Schölkopf, Smola, Müller [21], Zwald, Bousquet, Blanchard [25], Shawe-Taylor, Williams, Cristianini, Kandola [19], [20]). We consider the covariance operator where E is the expectation with respect to the law of the random vector X, or the Gram operator G = E , X X , whose principal eigenvectors provides the directions with maximum energy instead of maximum variance. Moreover, as we will show, the study of Σ can be deduced from the study of G. We assume that the law of X is unknown, so that we cannot work directly with Σ but we have to construct an estimator Σ as a function of the sample (X 1 , . . . , X n ). Results concerning the estimation of the spectral projectors of the covariance operator by their empirical counterparts in a Hilbert space can be found in Koltchinskii, Lounici [14], [15]. The authors study the problem in the case of Gaussian centered random vectors, based on the bounds obtained in [13], and in the setting where both the sample size n and the trace of the covariance operator are large.
A question that arises in standard PCA is how to determine the number of relevant components. A common choice is to maximise the gap between the lowest eigenvalue that is kept and the next one.
This type of choice is justified by the fact that the bounds available for the statistical deviations of the representation depend on the inverse of this spectral gap.
Our goal is to improve these bounds by improving on one hand the choice of the estimator Σ of the covariance matrix and on the other hand the choice of the representation itself, to make principal component analysis more robust to statistical fluctuations depending on the draw of the sample (X 1 , . . . , X n ).
So the kind of robustness we are after is not the same as in Candès, Li, Ma, Wright in [3] where they show that it is possible to recover the principal components of a data matrix in the case where the observations are contained in a low-dimensional space but arbitrarily corrupted by additive noise. Our approach provides two kinds of robustness. The first idea is to replace the projection on the principal eigenvectors by an alternative using functional calculus on the covariance operator. The fluctuations of this modified projection from sample to sample can be bounded depending on a quantity that is more favorable than the spectral gap.
While this improvement is of no help if we are precisely interested in performing a projection on the span of a given number of eigenvectors, in many situations, PCA is used more loosely to shrink the dimension of the data space while keeping as much of the variance as possible. One example of such a case is k-means unsupervised clustering for high-dimensional data. The usual recommendation is to avoid using directly the k-means algorithm in a high-dimensional space, but rather to perform a PCA reduction first. In such a context, except in some restrictive models, there is no reason why there should always be a large gap between meaningful and meaningless eigenvalues. Nonetheless, we still need a stable dimension reduction method, because it is still desirable to minimise the fluctuations of the cluster boundaries when the statistical sample used to compute them is replaced by another one. To start with, we cannot hope to compute stable clusters if we base the clustering on a change of representation that is sample dependent. This is where our modified projection may help: it will remain weakly dependent on the statistical sample choice (the precise meaning of this statement being provided by a non-asymptotic bound), even when no large spectral gap is available.
The second kind of robustness consists in using an estimator of the covariance operator from a statistical sample that has sub-exponential non asymptotic deviation bounds under polynomial moment assumptions, as explained in section 3 on page 3912.
The paper is divided into two parts. One is devoted to the theory of perturbations of nonnegative self-adjoint operators and the other one to the statistical analysis of PCA. In appendix, we propose a new treatment to the analysis of the relative positions of two projections, that we need for the proofs, and that has also some interest of its own.

A contribution to the perturbation theory of nonnegative self-adjoint operators
form an orthonormal basis of eigenvectors of A (resp. B) and where λ 1 ≥ λ 2 ≥ · · · are the eigenvalues of A (resp. μ 1 ≥ μ 2 ≥ · · · are the eigenvalues of B), sorted in decreasing order and satisfying lim Let us consider the orthogonal projectors on the span of the r first eigenvectors of each operator, defined as Define the spectral gaps The differences Π r (A) − Π r (B) and A − B are related by the relations where ∞ is the operator norm and HS is the Hilbert-Schmidt norm.

Proof. Recall that
Our proofs are essentially based on the identity and its consequence as well as on the relative positions of two projections, as described in the appendix. Eq. (2.1) is implied by the identities According to Lemma A.4 on page 3921, where S denotes the unit sphere of H. Moreover, for any u ∈ S ∩ Im Π r (B), On the other hand, u = r j=1 u, q j q j and r j=1 u, q j 2 = u 2 = 1.
Using the Cauchy-Schwarz inequality and assuming without loss of generality that μ r − λ r+1 ≥ 0, we get

I. Giulini
Exchanging the roles of A and B, this proves also that when λ r − μ r+1 ≥ 0, According to Lemma A.4 again, assuming without loss of generality that μ r − λ r+1 ≥ 0, and using eq. (2.3), we see that Exchanging A and B, this proves also that and therefore that Let us now assume without loss of generality that λ r −μ r+1 > 0 and μ r −λ r+1 > 0 (because otherwise γ r (A, B) = 0 and there is nothing to prove). Applying eq. (2.2) to Π r (A) and Π r (B) shows that

3909
Except maybe for the precise definition of the spectral gap, the last two results are well-known, but usually proved in a different way, applying the Cauchy integral formula to the resolvent. One advantage of our proof is to lead to a definition of the spectral gap γ r (A, B) ≥ λ r − λ r+1 /2 that is bounded from below independently of the value of B, so that we do not need to assume that the perturbation B − A is small in any sense and still get a meaningful bound.

Proposition 2.2.
In the same setting as in Proposition 2.1 on page 3905, let us consider some arbitrary L-Lipschitz function f : R + → R and define Proof. First remark that

I. Giulini
Moreover, for any positive integer r, Let us now remark that and choose r such that r ≤ r * ≤ r + 1.
Remark that for this choice of r On the one hand, where the last inequality will be proved later, in Proposition 2.3. On the other hand, is also non-decreasing on R + . Then for any positive integer i, In particular taking for g the constant function equal to A − B ∞ , we obtain that sup Proof. Let us remark that where the infimum is taken on all families of i − 1 independent vectors e 1 , . . . , e i−1 . This is similar to inequality (1.10) of [6] and is also related to eq. (6.73) on page 60 of [11] (or page 62 of [12]). Consequently,

I. Giulini
On the other hand,

Statistical study of principal component analysis in a separable Hilbert space
In this section, we will apply the previous bounds on the perturbation of a self-adjoint operator to the estimation of the principal components of a random vector X taking its values in a real separable Hilbert space H.
In [8,Proposition 3.4] we constructed an estimator G of the Gram operator G = E , X X satisfying the following properties (where we have made the choice of σ n explained just after [8, eq. (2.13)]).

Proposition 3.1. Assume that
where κ and T are known constants. Remark that as a consequence Assume without loss of generality that κ ≥ 3/2. Let δ, > 0 and and define Remark that for any fixed positive value of t, g n (t) is of order n −1/2 + ε G HS , and that more precisely, where O represents numerical constants in non-asymptotic bounds. For any value of ε > 0, we can compute an estimator G n of G, based on a sample (X 1 , . . . , X n ) made of n independent copies of X, whose computation cost depends on ε, such that, with probability at least 1 − 2δ, for any u ∈ S, u, Gu − u, G n u ≤ g n u, Gu and u, Gu − u, G n u ≤ g n u, G n u .
Moreover, the functions g n and t → t − g n (t) are non-decreasing on R + . As a consequence, on the same event of probability at least 1 − 2δ as above, Moreover, if λ 1 ≥ λ 2 ≥ . . . are the eigenvalues of G and λ 1 ≥ λ 2 ≥ . . . are the eigenvalues of G n , counted with their multiplicities, on the event of probability at least 1 − 2δ mentioned above, The fact that g n is non-decreasing is proved in [8,Lemma 7.7]. The fact that t → t − g n (t) is non-decreasing is a straightforward consequence of the fact that γ n is non-increasing.
This proposition, along with the results of the previous section concerning the perturbation of self-adjoint operators, can be used to estimate the fluctuations of Π r ( G n ), the estimated principal component representation of X. We give in the following corollary both theoretical and empirical bounds for these fluctuations. We give also the generalization of these bounds to the estimation of Lipschitz functionals f (G).
and for any L-Lipschitz function f : Let us give an example of the use of a Lipschitz functional f (G) instead of a projection Π r (G). For any positive real parameters a > b ≥ 0, let us define This is a Lipschitz function with Lipschitz constant (a − b) −1 . There is a unique pair of indices r ≤ s ∈ N such that λ r ≥ a > λ r+1 and λ s > b ≥ λ s+1 . We can express f a,b (G) as where p i is an orthonormal basis of eigenvectors of G and where We see therefore that f a,b (G) lies between Π r (G) and Π s (G), in the sense that Consequently, the energy kept by the representation f (G)X also lies in between: In the same way, there is a unique pair of indices r ≤ s such that λ r ≥ a > λ r+1 and is also a kind of interpolation between the two projections Π r ( G n ) and Π s ( G n ), since it can be written as where q i is an orthonormal basis of eigenvectors of G n .
The benefit of using f a,b ( G n ) instead of Π s (G) to map the data into a space of dimension s, is that the fluctuations of this representation now depend on the larger multi-step gap a − b λ r − λ s , rather than on the single-step gap λ r − λ r+1 .
Another option is to use a data dependent function f = f λr, λs , for some pair of indices r < s. Our deviation bounds being uniform on the choice of f , they allow for a data dependent f , so that we get for instance with probability at least 1 − 2δ that

Proposition 3.2 (Energy estimates). We will show here that it is possible to get an empirical estimate of the energy contained in the estimated representation.
In the previous setting, with probability at least 1 − 2δ, for any r ∈ N,

4)
and for any indices r < s, Moreover, with probability at least 1 − 2δ, that can also be written as Since on the same event of probability at least 1 − 2δ and, since the functions g n and t → t − g n (t) are non-decreasing, On the other hand, ending the proof of eq. (3.2). From eq. (3.6) and the fact that with probability at least 1 − 2δ we obtain eq. (3.3). Observe now that that leads to eq. (3.4) when combined with eq. (3.7). Finally, eq. (3.5) is a consequence of eq. (3.2) and the fact that If we want to exploit the inequality f (A) − f (B) HS ≤ L A − B HS of Proposition 2.2 on page 3909, we can use an estimator G of G such that G − G HS is properly controlled. One is given by Minsker in [16,Corollary 4.3]. It is based on a multidimensional extension of the median of means estimator and is such that with probability at least 1 − δ, In this setting, the only assumption on the data is that E X 4 < ∞. Minsker's estimator is such that with probability at least 1 − δ, for any a ≥ b ∈ R + , Remark that since G−G ∞ ≤ G−G HS , Minsker's estimator can also be used in conjunction with the operator norm bounds of Propositions 2.1 on page 3905 and Proposition 2.2 on page 3909, but that it would give looser inequalities. Note that the estimator we proposed in [8] has a better proved operator bound than Minsker's, at least in some cases. Therefore, it makes sense to use our estimator in conjunction with operator norm bounds instead of Minsker's. Indeed, under the assumptions of Proposition 3.1 on page 3912, considering that Tr(G) 2 ≤ E X 4 ≤ κ Tr(G) 2 , we do not weaken much Minsker's bound by replacing E X 4 with κ Tr(G) 2 , at least when κ is of order one. Making this substitution for the sake of comparison, we get that, for Minsker's estimator, whereas for our estimator we get the bound described in Proposition 3.1, that is of order where O represents numerical constants in non-asymptotic bounds. We have dropped the additional term in ε present in eq. (3.1), since it can be made arbitrarily small depending on the computation cost of the estimator. The difference between the two bounds is better understood while making a parallel with the finite dimension case. Here, after normalization by G ∞ , Tr(G) plays the role that would be played by the dimension d. Minsker's estimate uses the Hilbert-Schmidt norm. In other words, in the finite dimension case, it estimates the matrix G, considering it as a vector of coefficients of dimension d × d. Therefore Minsker gets a convergence rate of order d 2 /n, whereas considering G as a matrix and working with the operator norm, we can get a rate in d/n instead (although for a different estimator). In our bound, the interplay between log(δ −1 ) and the substitute for the dimension Tr(G) is also more favorable.
So far we considered the Gram operator G = E , X X , whereas principal component analysis is most often concerned with the covariance operator We can nevertheless use the results we presented for G in order to study Σ. One way to make the link between the two settings is to consider two independent copies (X 1 , X 2 ) of X and to remark that so that Σ turns out to be the Gram operator of (X 1 − X 2 )/ √ 2, and can be estimated as such from a sample (X 1 , . . . , X 2n ) of size 2n by forming the reduced centered sample (X 2i−1 − X 2i )/ √ 2, 1 ≤ i ≤ n of size n.

Appendix A: Pairs of Orthogonal Projections
In this appendix we introduce some results on orthogonal projectors that are interesting for their own sake. Let P, Q : H → H be two orthogonal projectors with finite ranks, defined on some separable real Hilbert space H. Let S be the unit sphere of H. The description of the relative positions of P and Q, or equivalently of Im(P ) and Im(Q) is a classical topic that goes back at least to [10]. More recently it has been treated in [6,12]. We would like to bring a contribution to this question based on the use of an orthonormal basis of eigenvectors of P +Q. In such a basis the description can be split into an orthogonal sum of problems of dimension one or two, reducing the description to the relative positions of two lines in the plane. This decomposition is related to the notion of principal angles introduced by Jordan [10]. In the following contribution, we obtain it from a straightforward construction that to our knowledge was not available in the literature, although the relation between the spectrum of P + Q and principal angles is proved in [7].
Let us start by listing the properties of the eigenvectors of P + Q.
Lemma A.1. Let x ∈ S be an eigenvector of P + Q with eigenvalue λ.
Proof. The operator P + Q is self-adjoint, nonnegative, of finite rank, and P + Q ∞ ≤ 2, so that there is a basis of eigenvectors and all eigenvalues are in the intervall [0, 2]. In case 1, 0 = P x + Qx, x = P x 2 + Qx 2 , so that P x = Qx = 0. In case 2, P Qx = P (x − P x) = 0 and similarly QP x = Q(x − Qx) = 0, so that x = P x + Qx, where P x ∈ ker(Q) ∩ Im(P ) and Qx ∈ ker(P ) ∩ Im(P ). In case 3, In case 4, remark that Therefore (P −Q)x is an eigenvector of P +Q with eigenvalue 2−λ = λ. Remark now that Similarly

Lemma A.2. There is an orthonormal basis
Proof. As already explained in the beginning of the proof of Lemma A.1, there exists a basis of eigenvectors of P + Q. From the previous lemma, we have that all eigenvectors in the kernel of P + Q are in ker(P ) ∩ ker(Q), and on the other hand obviously ker(P ) ∩ ker(Q) ⊂ ker(P + Q) so that ker(P + Q) = ker(P ) ∩ ker(Q).
In the same way the previous lemma proves that the eigenspace corresponding to the eigenvalue 2 is equal to Im(P )∩Im(Q). It also proves that the eigenspace corresponding to the eigenvalue 1 is included in and consequently is equal to Im(P )∩ker(Q) ⊕ ker(P )∩Im(Q) , so that we can form an orthonormal basis of this eigenspace by taking the union of an orthonormal basis of Im(P )∩ker(Q) and an orthonormal basis of ker(P ) ∩ Im(Q). Consider now an eigenspace corresponding to an eigenvalue λ ∈]0, 1[∪]1, 2[ and let x, y be two orthonormal eigenvectors in this eigenspace. From the previous lemma, remark that Therefore, if x 1 , . . . , x k is an orthonormal basis of the eigenspace V λ corresponding to the eigenvalue λ, then (P − Q)x 1 , . . . , (P − Q)x k is an orthogonal system in V 2−λ . If this system was not spanning V 2−λ , we could add to it an orthogonal unit vector y k+1 ∈ V 2−λ so that x 1 , . . . , x k , (P − Q)y k+1 would be an orthogonal set of non-zero vectors in V λ , which would contradict the fact that x 1 , . . . , x k was supposed to be an orthonormal basis of V λ . Therefore, is an orthonormal basis of V 2−λ . Doing this construction for all the eigenspaces V λ such that λ ∈]0, 1[ achieves the construction of the orthonormal basis described in the lemma.
Proof. According to Lemma A.1, is another orthogonal basis of H. Each vector of this basis is either in Im(P ) or in ker(P ) and more precisely P x 1 , . . . , P x m , x 2m+1 , . . . , x p , x q+1 , . . . , x s ∈ Im(P ), This proves the claim of the lemma concerning P . Since P and Q play symmetric roles, this proves also the claim concerning Q, mutatis mutandis. and for any orthonormal basis e 1 , . . . , e r of Im(P ), Proof. As P − Q is a self-adjoint operator, we have Remark that the basis described in Lemma A.2 is also a basis of eigenvectors of (P − Q) 2 . More precisely, according to Lemma A.1 If q − 2m > 0, then P − Q ∞ = 1, and q − p > 0, according to Lemma A.1, so that (P − Q)x p+1 = 1, where x p+1 ∈ Im(Q). If q = 2m and m > 0, there . Since x i and x m+i are two eigenvectors of (P − Q) 2 corresponding to this eigenvalue, all the non-zero vectors in span{x i , x m+i } (including P x i ) are also eigenvectors of the same eigenspace. Consequently (P − Q) 2 and therefore that sup x∈S (P −Q)x is reached in Im(P ). Finally, if q = 0, then P − Q is the null operator, so that sup x∈S (P − Q)x is reached everywhere, including in Im(P ) ∩ S. This concludes the proof of eq. (A.1). As for eq. (A.2), since is the trace of P (P − Q) 2 P , its value is independent of the choice of the orthonormal basis (e 1 , . . . , e r ) of Im(P ). Therefore it is enough to prove eq. (A.2) for any special choice of orthonormal basis for Im(P ). Let us put Accordingly, remembering that p − 2m = q − p, as stated in Corollary A.1, we see that Our study of the eigenvectors of P + Q is related to the notion of principal angles, as shown in [7]. Nevertheless, in [7], the properties of P + Q and its spectrum are deduced from other results about principal angles, whereas here we show conversely that the above fairly simple and elementary study of P + Q can be used to derive a description of principal angles.
Proposition A.1. In the previous setting, (of finite rank orthogonal projectors in a separable real Hilbert space) the principal angles between Im(P ) and Im(Q) are recursively defined as cos(θ k ) = u k , v k , θ k ∈ [0, π/2].
So let us now assume without loss of generality that this is the case from the beginning (that is from iteration one). In other words, let us assume that s−q = 0. Let us write, according to Lemma A.3 on page 3921, Remark that and that We can repeat this reasoning m times, showing that cos(θ k ) = λ k−s+q , s− q + 1 ≤ k ≤ s − q + m.
and describing the possible choices of (u k , v k ) as above. We can then assume without loss of generality that s − q = m = 0, or equivalently that Im(P ) ⊂ ker(Q) and Im(Q) ⊂ ker(P ). In this case we can choose for u any vector in Im(P ) ∩ S = span{x 2m+1 , . . . , x p } ∩ S and for v any vector in Im(Q) ∩ S = span{x p+1 , . . . , x q } ∩ S, and u, v = 0. This proves that θ k = π/2, when s − q + m + 1 ≤ k ≤ s − q + p − m, and shows the possible choices for (u k , v k ) in this range of values of the index k.