High dimension low sample size asymptotics of robust PCA

: Conventional principal component analysis is highly suscepti- ble to outliers. In particular, a suﬃciently outlying single data point, can draw the leading principal component toward itself. In this paper, we study the eﬀects of outliers for high dimension and low sample size data, using asymptotics. The non-robust nature of conventional principal component analysis is veriﬁed through inconsistency under multivariate Gaussian assumptions with a single spike in the covariance structure, in the presence of a contaminating outlier. In the same setting, the robust method of spherical principal components is consistent with the population eigenvector for the spike model, even in the presence of contamination.


Introduction
Principal components analysis (PCA) is widely used for high dimensional data (Jolliffe [1]), including high dimension, low sample size (HDLSS) data. Classical PCA represents the data using orthogonal components that are ordered according to maximum successive explained variability in the data. For mean-centered data, the principal components can be derived from a spectral decomposition of the sample covariance matrix. Both the sample mean and covariance are sensitive to outlying observations, and so classical PCA tends to be unreliable in the presence of outliers.
There are two viewpoints which give close understanding of conventional PCA. A simple view is eigen analysis of the covariance matrix; a second view is finding directions of maximal variation. For conventional PCA, those two approaches give the same solution. To develop approaches to robust PCA, each viewpoins has led to useful methods which are quite different in nature.
Using the first idea, Devlin and Gnanadesikan [2] did an eigen analysis of a robust estimate of the covariance matrix to develop a robust version of PCA. Their proposed robust covariance was based on robust location and scale estimators, which replaced the usual sample means and covariances. A problem with eigen analysis of such a robust covariance matrix is it is very challenging to get non-negative definite covariance matrix estimates, especially in HDLSS contexts. An interesting solution to that problem is called "minimum volume ellipsoid" (Rousseeuw [3,4]) that is a multivariate extension of the least median of squares. However this method required d < n, again rendering it useless for HDLSS data.
Li and Chen [5] used the second viewpoint of PCA focussing on the notion of direction of maximal variation. They proposed searching directly for the optimal direction to maximize an M -estimator of scale which is called projection pursuit. To find more than the first component, they subtract the projected residuals from the data, and apply projection pursuit again. While their method is well defined, i.e. exists, it is computationally intractable in high dimensions.
Locantore et al. [6] proposed a simple robust alternative to PCA. The idea is to first project the data onto a sphere, which will reduce the influence of the outliers. Then classical PCA is performed on the transformed data, resulting in Spherical PCA (SPCA). When the data follows a Gaussian distribution, with a single large eigenvalue, the many data points in the stretched ellipsoid will project to ice caps on the sphere, so SPCA will find essentially the same direction of maximal variation. SPCA has a close relationship to the idea of "multivariate signs", see Oja [7] for a good introduction to this area. In particular, the good robustness properties of SPCA are not surprising, because it is just the PCA of the sign representations of the data.
The asymptotic behavior of classical PCA for HDLSS data has been established by Jung and Marron [8] under various versions of the spike eigenvalue model, with one or only a few large eigenvalues (Johnstone and Silverman [9]). They explored conditions under which the conventional PCA was consistent in terms of the spike parameter α.
The major contributions of this manuscript are as follows. SPCA is shown to be consistent under the HDLSS asymptotic regime, under the same conditions as PCA. In the presence of a Gaussian outlier, conventional PCA is shown to be inconsistent. However, SPCA is shown to still be consistent in the presence of the outlier. The implications of these results are important, as they establish SPCA as an important and robust tool and an attractive alternative to PCA. Here, robustness with respect to outliers and SPCA are for the first time studied rigorously in the HDLSS asymptotic context.

Notation
Let A = [X 1 , X 2 , . . . , X n ] be a d × n data matrix, with fixed sample size n and large dimension d → ∞, where the samples X j = (X 1j , . . . , X dj ) T , j = 1, 2, . . . , n are independent identically distributed (i.i.d.) random vectors with mean zero and unknown population covariance matrix Σ, i.e. X j ∼ N (0, Σ). PCA is essentially equivalent to singular value decomposition (SVD) on the mean centered data matrix. The left population eigenvector matrix U is the population eigenvector matrix of Σ. The first column of U is where U is the left sample eigenvector matrix, V is the right sample eigenvector matrix. The diagonal entries of S are the square roots of the non-zero eigenvalues of both AA T and A T A. The sample covariance matrix is Σ. We use "f (d) ∼ g(d)" to denote lim d→∞ f (d) g(d) = 1 and "f (d) → constant" to denote lim d→∞ f (d) = constant. We also use "= L " to mean equal in law. The symbol "≫" is used for an approximation of the much greater than sign.

Spiked covariance model
One challenge of HDLSS data is that conventional principal component analysis may give inaccurate estimation of the population eigenvalues and eigenvectors. For example, all but n of the eignvalues of Σ must be 0. HDLSS asymptotics have provided many useful insights through studying the case where a small subset of the eigenvalues are much larger than the rest (Jung and Marron [8]). This is called the spiked covariance model. The spiked covariance model assumes a covariance matrix of the type Σ = U ΛU T , Λ = diag(τ 1 , τ 2 , . . . , τ p , σ, . . . , σ), In this manuscript, we consider the informative simple case where Σ = diag(d α , 1, . . . , 1), for α > 0. After understanding this simple scenario, it is easier to extend the result to the full multi-spike model (Jung and Marron [8]).

Spherical PCA
The Spherical Principal Components Analysis (SPCA) procedure was derived by Locantore et al. [6] as a robust functional data analysis method. The idea is to perform classical PCA on the data, projected onto a unit sphere. Let c be the L 1 median and Y i = (X i − c)/||X i − c|| be the projected data. Locantore's procedure consists of using the eigenvectors of the covariance matrix of Y i .
Let A be A's projection on the sphere. Correspondingly, the SVD of A is where we use * to denote the estimation on the sphere. For example, let µ 1 be the first column of U which is the left sample eigenvector matrix of A.
A simple example to illustrate how SPCA, i.e. projection onto a sphere, leads to robust PCA. A two dimensional dataset is simulated from the multivariate normal distribution with Σ = 1 0.9 0.9 1 . Among the 100 samples, 2 are outliers (big solid dots in Figure 1). The direction of sample PC1 (dashed line) is influenced by the extreme outliers, so it has a large angle with the population PC1 (thin solid line). After projecting the data on the sphere, the outliers (triangle symbol) did not show much influence of the sample eigenvectors which overlay with population PC1. This example motivates handling outliers by using SPCA for HDLSS data (Locantore et al. [6]). In this manuscript, we assume the data has been centered, i.e. c = 0.

Consistency and strong inconsistency
In our HDLSS study of the impact of outliers on PCA and the usefulness of SPCA in countering that, we use the definition of HDLSS consistency and strong inconsistency (Jung and Marron [8]).
• Consistency: The direction µ 1 is consistent with its population counter part µ 1 if Angle( µ 1 , µ 1 ) → p 0 • or 180 • as d → ∞. Note that 180 • is included because the sign of the eigen direction µ 1 is arbitrary. • Strong inconsistency: The direction µ 1 is said to be strongly inconsistent with its population counter part µ 1 if Angle( µ 1 , µ 1 ) → p 90 • as d → ∞.

Underlying Gaussian model
In this section, we investigate the behavior of the first SPCA component computed from Gaussian distributed data when d → ∞ and n is fixed.
is strongly inconsistent.
This shows that SPCA provide us the same consistency properties as conventional PCA.
Proof. In the limit as d → ∞, . . , Z jn have the standard normal distribution. It follows that For the assumed form of Σ, the population eigenvector with respect to the largest eigenvalue is For, the n = 1, i.e. one sample case, the spherical sample eigenvector µ 1 is X1 ||X1|| . Therefore the inner product of the spherical sample and population eigenvectors, i.e. 0 • or 180 • w.p. 1/2, and this is consistent.
In the case of 0 < α < 1, for j = k, by the Law of Large Numbers i.e. A T A ∼ I n×n , with the largest sample eigenvalue 1 and an arbitrary set of eigenvectors. Therefore the first element of U is a random direction. Thus, using HDLSS results from Hall, Marron and Neeman [10], the angle between the SPCA eigenvector and the dominant population eigenvector tends to 90 • .

Impact of outliers
All the properties mentioned before are based on the assumption that the data follow a Gaussian distribution. In practice, real data often violate that assumption. This happens when there are large outliers, which may not be easily distinguishable, which can severely impact conventional PCA as shown in Figure 1. When we encounter a potential outlier, one natural viewpoint is that the observation resulted from a mistake or other extraneous effect, and should be discarded (Hampel et al. [11]). In other situations (Huber and Ronchetti [12]), outliers can have useful information, and should be only downweighted, not deleted. In the HDLSS study of Locantore et al. [6], the second view point was particularly relevant. Outliers conveyed important information about the underlying population, which would have been lost by just dropping the observations.
To model a scenario with outliers, we assume the data come from a contaminated normal distribution in which the majority of samples are from a specified multivariate Gaussian distribution, but a small proportion are from a multivariate Gaussian distribution with much higher variance. In particular, we assume the first n − 1 samples come from the spiked normal model of section 1.2 and the last sample X n ∼ N (0, Σ 2 ), where Σ 2 = d β I n×n . i.e. For β large, there will be a distinct outlier coming from a random direction.  (d α , 1, . . . , 1), the nth sample X n ∼ N (0, Σ 2 ), with Σ 2 = d β I n×n , where α, β ∈ R + with n fixed and d → ∞, it follows that • when 1 < α and β < α, Angle( µ 1 , µ 1 ) → p 0 • or 180 • , i.e. the direction of µ 1 is consistent to µ 1 ; • for α < 1 or α < β, Angle( µ 1 , µ 1 ) → p 90 • , i.e. the direction of µ 1 is asymptotically perpendicular to µ 1 , i.e. is strongly inconsistent.
Theorem 3.1 mathematically quantifies the extent to which µ 1 is severely influenced by outliers.
Proof. For the nth sample, In the conventional principal component analysis, the sample covariance ma- • For the case of 1 < α and β < α, by (2.2) and (3.1), i.e. AA T ∼ diag((n − 1)d α , 0, . . . , 0), where the symbol ∼ means elementwise approximation asymptotically with d → ∞. By eigen analysis, there is only one dominant eigenvalue of AA T n and µ 1 is consistent with µ 1 . • For the case of α < β with fixed n and d → ∞, since (n − 1) Therefore the sample covariance matrix of A approximates Σ 2 as d → ∞.
Thus the first sample eigen vector has a random direction.

Fig 1. Two dimensional example showing how outliers (red dots) can strongly impact the conventional PC1 direction (red dashed vector) and how spherical PCA downweights the influence of outliers (blue triangles), giving an SPC1 direction (green vector) which is a much better estimate of the true first population eigen direction (black line).
• In the case of α < 1, if α < 1 < β, then it belongs to the case with α < β; if α, β < 1, by (2.2), for any j < n, we have For any j < n, Similarly, for any j < n, k < n, j = k, Hence the dual sample covariate matrix satisfies A T A d ∼ In−1 0 0 0 . Based on proposition 1 (Jung and Marron [8]), λ1 d → constant which is greater than 0 and Angle( µ 1 , µ 1 ) → p 90 • as d → ∞.
From the above, we conclude that conventional PCA is very sensitive to an outlier. However, as shown in Figure 1, if we project the data on the sphere, SPCA can be very robust to outliers. This direction is asymptotically studied in the next theorem.
Note that this result is independent of β which can be arbitrarily large and still not affect the consistency of SPCA.
Proof. In the case of α > 1, the (j, n)th entry of the matrix A T A (j = n) is The (j, k)th (j = n, k = n, j = k) entry of the matrix A T A is Thus in the case of α > 1, where d α ≫ d, the (j, k)th entry of the matrix In another words, A T A is the rank 1 matrix which is the outer product of the vector [ Z11 |Z11| , Z21 |Z21| , . . . , Zn−11 |Zn−11| , Z n1 ]. The maximum eigenvalue for A T A can be derived as ( Z11 |Z11| ) 2 + ( Z21 |Z21| ) 2 + · · · + ( Zn−11 |Zn−11| ) 2 + Z 2 n1 = n − 1 + Z 2 n1 and its Following the same argument as in the proof of Theorem 2.1, the first element of µ 1 is Hence µ 1 , µ 1 = 1. We conclude that SPCA is consistent when α > 1 with an outlier sample.
In the case of 0 < α < 1, 0 < β < 1, for j = n, by (4.7) and the same logic as in (4.6), it follows that the (j, k)th entry of the matrix A T A is For j = k, j, k = n, the (j, k)th entry of A T A is the same as (4.8). Thus A T A ∼ I n×n , with the largest sample eigenvalue 1 and no-fixed eigenvector corresponding to it. Therefore the first element of U is random. On the high dimensional sphere, the angle between the sample eigenvector and the dominant population eigenvector tends to 90 • .
In robust statistics, distributions containing outliers are commonly modeled by the contaminated normal model: Good overview of these ideas and their historical background can be found in Huber [13]. The model with a single outlier in Theorems 3.1, 3.2, 4.2, 4.3 is slightly different from this because the number of outliers are not random. This was done to give simpler and more revealing insights. We conjecture that entirely parallel results can be derivied for the contaminated normal model.
In the case of 0 < α m < · · · < α t+1 < 1, for j = k, by the Law of Large Numbers d l=2 Z jl Z kl d → p 0, and by (4.4), it follows that the (j, k)th entry of the matrix A T A is i.e. A T A ∼ I n×n , with the largest sample eigenvalue 1 and an arbitrary set of eigenvectors. Therefore the first element of U is a random direction. Therefore the spherical PC l direction, µ l is strongly inconsistent for l = t + 1, . . . , m.
However conventional PCA is not robust to outliers.
• For the case of α l < β with fixed n and d → ∞, since (n− 1)d α l + d β ∼ d β , in the sense that (n−1)d α l +d β Therefore the sample covariance matrix of A approximates Σ 2 as d → ∞. Thus the first sample eigen vector has a random direction.
• The proof of the case of α l < 1 is very similar to that in Theorem 3.1. In particular, for any j < n, k < n, j = k, Spherical PCA gives the same clean asymptotic properties as in Theorem 4.1 when we have contaminated samples.
Proof. The main proof is very similar to the proof of Theorem 4.1. Here we only show the case with α 1 > 1. In the case of α l > 1, the (j, n)th entry of the matrix Using (4.2) The (j, k)th (j = n, k = n, j = k) entry of the matrix A T A is The (j, k)th entry of the matrix A T A is We can easily show that µ 1 , µ 1 = 1. Iteratively, for l = 2, . . . , t, ( A − P µ1,...,µl−1 A) T ( A − P µ1,...,µl−1 A) is a rank 1 matrix. Using arguments as in Theorem 3.2, we get µ l , µ l = 1, therefore Angle( µ l , µ l ) → p 0 • or 180 • , for l = 1, 2, . . . , t, i.e. the spherical PC l direction, µ l , is consistent to µ l .

Discussion
A key assumption of this paper made to give direct access to the critical robustness insights is that the data are properly centered. In practice, the centering can be an important issue. Centering using the L1 M-estimate is recommended (Locantore et al. [6]), because that is intuitively consistent with spherical PCA. An interesting potential approach suggested by a reviewer, is to tackle the centering issue by applying SPCA to the pairwise differences of the data. An interesting open problem is the impact of the estimation on the asymptotics, which we conjecture will be negligible. Detailed investigation of this can be done essentially using Taylor expansion methods on ( c − c), where c is the sample version of the L1 M-estimate and c is the true popoulation center. This is not pursued here for two reasons. First the asymptotic behavior of ( c − c) needs to be analyzed, and to our knowledge this does not appear in the literature. Second, the relatively streamlined and insightful (about the robustness of the PCA direction, which is the point of this paper) proofs we currently have will tend to be obscured having by ( c − c) terms appearing in the analysis.
Also worthwhile would be extension of the theory in other directions, including more general distributional assumptions and outlier configurations. Another challenge for future work is the special case at the boundary α = 1, where conventional PCA was explored in Jung, Sen and Marron [14]. We believe that parallel results can also be established under appropriate non-Gaussian models, using e.g. sufficient moment conditions, based on ideas from Yata and Aoshima [15]. Theorems 3.2 and 4.3 suggest good breakdown properties of SPCAs. Another interesting open problem is precise quantification of breakdown (Hampel et al. [11]).