Geometric consistency of principal component scores for high‐dimensional mixture models and its application

In this article, we consider clustering based on principal component analysis (PCA) for high‐dimensional mixture models. We present theoretical reasons why PCA is effective for clustering high‐dimensional data. First, we derive a geometric representation of high‐dimension, low‐sample‐size (HDLSS) data taken from a two‐class mixture model. With the help of the geometric representation, we give geometric consistency properties of sample principal component scores in the HDLSS context. We develop ideas of the geometric representation and provide geometric consistency properties for multiclass mixture models. We show that PCA can cluster HDLSS data under certain conditions in a surprisingly explicit way. Finally, we demonstrate the performance of the clustering using gene expression datasets.

so on. In recent years, substantial work has been done on HDLSS asymptotic theory, where the sample size n is fixed or n∕d → 0 as the data dimension d → ∞. Hall, Marron, and Neeman (2005), Ahn, Marron, Muller, and Chi (2007), Yata and Aoshima (2012), and Lv (2013) explored several types of geometric representations of HDLSS data. Jung and Marron (2009) showed inconsistency properties of the sample eigenvalues and eigenvectors in the HDLSS context. Yata and Aoshima (2012Aoshima ( , 2013 developed the noise-reduction methodology to provide consistent estimators of both the eigenvalues and eigenvectors together with principal component (PC) scores in the HDLSS context. Shen, Shen, Zhu, and Marron (2016) and Hellton and Thoresen (2017) also provided several asymptotic properties of the sample PC scores in the HDLSS context.
The HDLSS asymptotic theory was created under the assumption of either the population distribution is Gaussian or the random variables in a sphered data matrix have a -mixing dependency. However, Yata and Aoshima (2010) developed an HDLSS asymptotic theory without such assumptions. Moreover, they created a new principal component analysis (PCA) called the cross-data-matrix methodology that is applicable to construct an unbiased estimator in HDLSS nonparametric settings. Based on the cross-data-matrix methodology, Aoshima and Yata (2011) developed a variety of inference for HDLSS data such as given-bandwidth confidence region, two-sample test, classification, variable selection, regression, pathway analysis, and so on (see Aoshima et al., 2018 for the review).
PCA is an important visualization and dimension reduction technique for high-dimensional data. Furthermore, PCA is quite popular for clustering high-dimensional data (see section 9.2 in Jolliffe, 2002 for details). For clustering HDLSS gene expression data, see Armstrong et al. (2002) and Pomeroy et al. (2002). Liu, Hayes, Nobel, and Marron (2008) and Ahn, Lee, and Yoon (2012) gave binary split-type clustering methods for HDLSS data. Borysov, Hannig, and Marron (2014) considered hierarchical clustering for high-dimensional data. Li and Yao (2018) considered a model-based clustering for a high-dimensional mixture. Given this background, we decided to focus on high-dimensional structures of multiclass mixture models via PCA. In this article, we consider asymptotic properties of PC scores for high-dimensional mixture models to apply for cluster analysis in HDLSS settings. The main contribution of this article is that we give theoretical reasons why PCA is effective for clustering HDLSS data.
Suppose there are independent and d-variate populations, Π i , i = 1, … , k, having an unknown mean vector i and unknown (positive-semidefinite) covariance matrix i for each i. We consider a mixture model to classify a dataset into k (≥2) groups. We assume that any sample is taken with mixing proportions i s from Π i s, where i ∈ (0, 1) and ∑ k i=1 i = 1 but the label of the population is missing. We assume that k and i s are independent of d.
We consider a mixture model whose probability density function (or probability function) is given by where x ∈ R d and i (x; i , i ) is a d-dimensional probability density function (or probability function) of Π i having a mean vector i and covariance matrix i . Suppose we have a d × n data matrix X = [x 1 , … , x n ], where x j , j = 1, … , n, are independently taken from Equation (1). We assume n ≥ k. Let n i = #{j|x j ∈ Π i for j = 1, … , n} and i = n i ∕n for i = 1, … , k, where #A denotes the number of elements in a set A. We assume that n and n i s are independent of d. Let and be the mean vector and the covariance matrix of Equation (1), respectively. Then, we have that We note that E(x|x ∈ Π i ) = i and Var(x|x ∈ Π i ) = i for i = 1, … , k. We denote the eigendecomposition of by = H H T , where = diag( 1 , … , d ) having eigenvalues 1 ≥ … ≥ d ≥ 0 and H = [h 1 , … , h d ] is an orthogonal matrix of the corresponding eigenvectors. Let x j − = H 1∕2 (z 1j , … , z dj ) T for j = 1, … , n. Then, (z 1j , … , z dj ) T is a sphered data vector from a distribution with the identity covariance matrix; E{(z 1j , … , z dj ) T } = 0 and Var{(z 1j , … , z dj ) T } = I d , where I d denotes the d-square identity matrix. The ith true PC score of x j is given by We note that Var(s ij ) = i for all i, j. Let Δ i = || i || 2 for i = 1, … , k, where || ⋅ || denotes the Euclidean norm. Here, we assume that without loss of generality. We also assume that for the sake of simplicity.
Hence, for any inference of by the sample covariance matrix, one can assume k = 0 without loss of generality.
As the sign of an eigenvector is arbitrary, we assume that h T i i ≥ 0 for i = 1, … , k − 1, without loss of generality. In addition, we assume the cluster means are more spread than the within class variation in the sense that: Here, max (M) denotes the largest eigenvalue of any positive-semidefinite matrix, M. We consider clustering x 1 , … , x n into one of Π i s in HDLSS situations. When k = 2, Yata and Aoshima (2010) gave the following result: we denote the angle between two nonzero vectors x and y by Angle(x, y) = cos −1 {x T y∕(||x|| ⋅ ||y||)}. By noting that 2 = 0, under Condition 1, it holds that as from the fact that 1 ∕Δ = h T 1 h 1 ∕Δ = 1 2 (h T 1 1 ) 2 + o(1) as d → ∞ under Condition 1. Furthermore, for the normalized first PC score s 1j ∕ 1∕2 1 for j = 1, … , n. Here, "plim" denotes the convergence in probability. This result is a special case of Theorem 2 in Section 3. See Remark 8. One would be able to cluster x j s into two groups if s 1j is accurately estimated in HDLSS situations.
In this article, we consider asymptotic properties of sample PC scores for Equation (1) in the HDLSS context that d → ∞ while n is fixed. In Section 2, we first derive a geometric representation of HDLSS data taken from the two-class mixture model. With the help of the geometric representation, we give geometric consistency properties of sample PC scores in the HDLSS context. We show that PCA can cluster HDLSS data under certain conditions in a surprisingly explicit way. In Section 3, we investigate asymptotic behaviors of true PC scores for the k(≥3)-class mixture model and provide geometric consistency properties of sample PC scores when k ≥ 3. In Section 4, we demonstrate the performance of clustering based on sample PC scores using gene expression datasets. We show that the real-HDLSS datasets hold the geometric consistency properties.

PC SCORES FOR TWO-CLASS MIXTURE MODEL
In this section, we consider PC scores for the two-class (k = 2) mixture model.

Preliminary
The sample covariance matrix is given by Then, we define the n × n dual sample covariance matrix by S D = (n − 1) −1 (X − X) T (X − X). We note that rank(S D ) ≤ n − 1. Let̂1 ≥ ⋅ ⋅ ⋅ ≥̂n −1 ≥ 0 be the eigenvalues of S D . Then, we define the eigendecomposition of S D by whereû i = (û i1 , … ,û in ) T denotes a unit eigenvector corresponding tôi. As the sign ofû i s is arbitrary, we assumeû T i z i ≥ 0 for all i without loss of generality, where z i is defined by z i = (z i1 , … , z in ) T . Note that S and S D share the nonzero eigenvalues. Let We note thatẑ ij is an estimate of s ij ∕ 1∕2 i (=z ij ) for i = 1, … , n − 1; j = 1, … , n from the facts that whereĥ i denotes a unit eigenvector of S corresponding tôi. Let X 0 = X − 1 T n and P n = I n − n −1 1 n 1 T n . We note that S D = P n X T 0 X 0 P n ∕(n − 1). We consider the sphericity condition: tr( 2 )∕tr( ) 2 → 0 as d → ∞. Note that the sphericity condition is equivalent to " 1 ∕tr( ) → 0 as d → ∞." When one can assume that X is Gaussian or Z = (z ij ) is -mixing and the fourth moments of each variable in Z are uniformly bounded, under the sphericity condition, Jung and Marron (2009) suggested a geometric representation as follows: Remark 2. Yata and Aoshima (2012) showed that Equation (4) holds under the sphericity condition and Var(||x j − || 2 )∕tr( ) 2 → 0 as d → ∞.
From Equation (4), we observe that the eigenvalue becomes deterministic as the dimension increases, whereas the eigenvector of S D does not uniquely determine the direction. In addition, Hellton and Thoresen (2017) present asymptotic properties of the sample PC scores when Z is -mixing. We note that Equation (1) does not presuppose the assumption that X is Gaussian or Z is -mixing. See section 4.1.1 in Qiao, Zhang, Liu, Todd, and Marron (2010) for details. In the present article, we present new asymptotic properties of the sample PC for Equation (1).

2.2
Geometric representation and consistency property of PC scores when k = 2 We will find a geometric representation for Equation (1) and the finding is completely different from Equation (4). We assume the following conditions:
We define that The following result gives a geometric representation for Equation (1) when k = 2.
When S D ≠ O, we note thatû T 1 1 n = 0, so thatû T 1 P n =û T 1 . Thus from Equation (5), the first eigenvector of S D uniquely determines the direction. In fact, by noting r T 1 n = 0 and ||r|| 2 = n 1 2 , we have the following results for the first eigenvector and PC scores when k = 2. Using Corollary 1, one can cluster x j s into two groups by the sign ofẑ 1j s: We considered an easy example such as We note that Δ 1 = d and 1 ≠ 2 but tr( 1 ) = tr( 2 ) = d. Then, Conditions 2-4 hold. We set n 1 = 1 and n 2 = 2. We took n = 3 samples as x 1 ∈ Π 1 and x 2 , x 3 ∈ Π 2 . In Figure 1, we displayed scatter plots of 20 independent pairs of ±û 1 when (a) d = 5, (b) d = 50, (c) d = 500, and (d) d = 5,000. We denoted r = (2∕3, −1∕3, −1∕3) T by the solid line and 1 n = (1, 1, 1) T by the dotted line. We note that Angle(û 1 , 1 n ) = ∕2 when S D ≠ O. We observed that all the plots of ±û 1 gather on the surface of the orthogonal complement of 1 n . Also, the plots appeared close to r as d increases. Thus, one can cluster x j s into two groups by the sign ofẑ 1j s.

PC SCORES FOR MULTICLASS MIXTURE MODEL
In this section, we consider PC scores for the k(≥ 3)-class mixture model.

Asymptotic behaviors of true PC scores when k ≥ 3
We assume the following condition: Remark 7. We consider the case when all elements of i s are constants (not depending on d) such See the settings of Figures 3 and 4. Note that Δ 1 ≫ ⋅⋅⋅ ≫ Δ k−1 under Condition 5. We emphasize that Conditions 1-4 become strict as k increases under Condition 5.
We have the following results.

Theorem 2. Under Conditions 1 and 5, it holds that for
F I G U R E 3 Toy example to illustrate the asymptotic behaviors of true principal component scores when k = 3. We plotted (z 1j , z 2j ) which is denoted by small circles when x j ∈ Π 1 , by small triangles when x j ∈ Π 2 , and by small squares when x j ∈ Π 3 . The dashed triangle consists of three vertices, namely, (1, 0), (−1, 2 1/2 ), and (−1, −2 1/2 ), which are theoretical convergent points [Colour figure can be viewed at wileyonlinelibrary.com] F I G U R E 4 Toy example to illustrate the asymptotic behaviors of true principal component scores when k = 4. We plotted (z 1j , z 2j , z 3j ). The dashed triangular pyramid was given by Equation (6)  Remark 8. The consistency in Equation (3) is equivalent to Equation (6) with k = 2 and i = 1.

Corollary 2. Under Conditions 1 and 5, it holds that for
For example, when k = 3, from Equation (6), we have that for j = 1, … , n One can check whether x j ∈ Π 1 or not by the first PC score. If x j ∉ Π 1 , one can check whether x j ∈ Π 2 or x j ∈ Π 3 by the second PC score. In general, one can cluster x j s using at most the first k − 1 PC scores.

REAL-DATA EXAMPLES
We demonstrate the performance of clustering, based on sample PC scores, using gene expression datasets.

Clustering when k = 2
We analyzed microarray data by Chiaretti et al. (2004) in which the dataset consists of 12,625 (=d) genes and 128 samples. The dataset has two tumor cellular subtypes, Π 1 ∶ B cell (95 samples) and Π 2 ∶ T cell (33 samples). Refer to Jeffery, Higgins, and Culhane (2006) as well. We checked behaviors of the PC scores using several samples from the two tumor cellular subtypes. We considered three cases: (a) n = 10 samples consist of the first five samples from both Π 1 and Π 2 (i.e., n 1 = 5 and n 2 = 5), (b) n = 40 samples consist of the first 20 samples from both Π 1 and Π 2 (i.e., n 1 = 20 and n 2 = 20), and (c) n = 128 samples consist of n 1 = 95 samples from Π 1 and n 2 = 33 samples from Π 2 . In the top panels of Figure 6, we displayed scatter plots of the first two PC scores, (ẑ 1j ,ẑ 2j )s, for (a), (b), and (c). From Corollary 1, we denoted ( 2 ∕ 1 ) 1/2 and −( 1 ∕ 2 ) 1/2 by dotted lines. For (a), we observed that the estimated PC scores give good performances. The first PC scores gathered around ( 2 ∕ 1 ) 1/2 or −( 1 ∕ 2 ) 1/2 . For (b), the estimated PC scores gave adequate performances except for the two points from Π 2 . Those two samples, which are the ninth and twentieth samples of Π 2 , are probably outliers. In fact, the two points are far from the cluster of Π 2 . The other 38 samples were perfectly classified into the two groups by the sign of the first PC scores. As for (c), although there seemed to be two clusters except for the two samples, we could not classify the dataset by the sign of the first PC scores. This is probably because 1 and 2 are unbalanced. From Equation (2), when the mixing proportions are unbalanced, 1 becomes small. The first eigenspace was possibly affected by the other eigenspaces, so that the first PC scores appear in the wrong direction. We tested the clustering except for the outlying two samples. We used the remaining 31 samples for Π 2 . We considered the following three cases for samples from Π 1 : (d) the first 16 samples from Π 1 , so that n 1 = 16, n 2 = 31, n = 47, and 1 ∕ 2 ≈ 0.5; (e) the first 31 samples from Π 1 , so that n 1 = 31, n 2 = 31, n = 62, and 1 ∕ 2 = 1; and (f) the first 62 samples from Π 1 , so that n 1 = 62, n 2 = 31, n = 93, and 1 ∕ 2 = 2. In the bottom panels of Figure 6, we displayed scatter plots of (ẑ 1j ,ẑ 2j )s for (d), (e), and (f). For (d) and (e), we observed that the estimated PC scores give good performances. As for (f), although there seemed to be two clusters, we could not classify the dataset by the sign of the first PC scores. Note that 1 and 2 are unbalanced in (d) and (f). Even though (d) is an unbalanced case, the estimated PC scores worked well for the case. We had an estimate for the ratio of the largest eigenvalues, max ( 1 )∕ max ( 2 ), as 1.598 by the noise-reduction methodology given by Yata and Aoshima (2012). The first eigenspace of in (d) is less affected by the first eigenspace of i s than in (f) as = 1 2 ( 1 − 2 )( 1 − 2 ) T + 1 1 + 2 2 . This is probably the reason why the estimated PC scores gave good performances even in (d).

Clustering when k ≥ 3
We analyzed microarray data by Bhattacharjee et al. (2001) in which the dataset consisted of five lung carcinomas types with d = 3,312. We only used four classes as Π 1 ∶ pulmonary carcinoids (20 samples), Π 2 ∶ normal lung (17 samples), Π 3 ∶ squamous cell lung carcinomas (21 samples), and Π 4 ∶ adenocarcinomas (20 samples), so that n 1 = 20, n 2 = 17, n 3 = 21, and n 4 = 20. Note that Π 4 originally had 139 samples. We used only the first 20 samples from Π 4 in order to keep balance in sample sizes with the other classes. We first considered clustering when k = 3 under the following setups: (a) the dataset consists of Π 1 , Π 2 , and Π 3 (n = 58); (b) the dataset consists of Π 1 , Π 2 , and Π 4 (n = 57); and (c) the dataset consists of Π 1 , Π 3 , and Π 4 (n = 61). In Figure 7, we displayed scatter plots of the first two PC scores, (ẑ 1j ,ẑ 2j )s, for each of (a), (b), and (c). Also, we displayed the triangle given by Equation (7) with k = 3 using Theorem 3. We observed that the estimated PC scores give good performances. The three clusters gathered around each vertex for (a), (b), and (c). Next, we considered clustering when k = 4 ∶ Π i , i = 1, … , 4, so that n = 78. In Figure 8, we displayed scatter plots of the first three PC scores. The dataset seemed not to converge to the theoretical convergent points given by Equation (7) in Theorem 3. This is probably because the conditions of Theorem 3 become strict as k increases. See Remark 7. Thus, the convergence is slower than in the case when k = 3 as in Figure 7. However, there seemed to be four separate clusters of each Π i .
Finally, we introduce an example using next generation sequencing datasets. Shen, Shen, Zhu, andMarron (2012, 2016) gave a scatter plot of first two PC scores for the next generation sequencing cancer data by Wilhelm and Landry (2009) in which the dataset consists of three curves with d = 1,709 and n = 180. See Figure 9 which was given in figure 1 of Shen et al. (2012). The three clusters seem to compose of a triangle such as Figure 7.
(a) Π 1 , Π 2 and Π 3 (b) Π 1 , Π 2 and Π 4 (c) Π 1 , Π 3 and Π 4 F I G U R E 7 Scatter plots of the first two principal component scores, supposing k = 3 in the dataset of Bhattacharjee et al. (2001). We denoted them by small circles when x j ∈ Π 1 , by small triangles when x j ∈ Π 2 , by small squares when x j ∈ Π 3 , and by small inverted triangles when x j ∈ Π 4 . The theoretical convergent points are denoted by the vertices of the triangle [Colour figure can be viewed at wileyonlinelibrary.com] (i) (z 1j , z 2j )ˆ(ii) (z 1j , z 3j )ˆ(iii) (z 1j , z 2j , z 3j )ˆF I G U R E 8 Scatter plots of the first three principal component scores, supposing k = 4 in the dataset of Bhattacharjee et al. (2001). The dashed triangles and triangular pyramid were given by Equation (7)

Clustering: Special case
We analyzed microarray data by Armstrong et al. (2002) in which the dataset consists of three leukemia subtypes having 12,582 (=d) genes. We used two classes such as Π 1 : acute lymphoblastic leukemia (24 samples) and Π 2 : mixed-lineage leukemia (20 samples), so that n 1 = 24, n 2 = 20, and n = 44. In Figure 10, we displayed scatter plots of the first three PC scores. We observed that the dataset is perfectly separated by the sign of the second PC scores. This figure looks completely different from Figure 6. This is probably because the largest eigenvalue, max ( 1 ) or max ( 2 ), is too large. When k = 2, we give the following result to explain the reason of the phenomenon in Figure 10. Under the assumptions of Proposition 2, one can cluster x j s into two groups by some ith PC score even when Condition 1 is not met: Then, there exists some positive integer i ⋆ such that

I G U R E 10
Scatter plots of the first three principal component scores, supposing k = 2 in the dataset of Armstrong et al. (2002) [Colour figure can be viewed at wileyonlinelibrary.com] Furthermore, assume that i ⋆ is distinct in the sense that We estimated the largest eigenvalue using the noise-reduction methodology given by Yata and Aoshima (2012). By noting Remark 1, we considered Δ 1 as Δ 1 = || ′ 1 || 2 = || 1 − 2 || 2 . We estimated Δ 1 using an unbiased estimator given by Aoshima and Yata (2014). Then, we obtained the estimates of ( max ( 1 )∕Δ 1 , max ( 2 )∕Δ 1 ) as (0.465, 0.787), so that Condition 1 is not met obviously. In addition, by estimating i s by i s, we had 2 max ( 2 ) > 1 2 Δ 1 . Thus, the first eigenspace of is probably the first eigenspace of 2 as = 1 2 ( 1 − 2 )( 1 − 2 ) T + 1 1 + 2 2 . Thus, i ⋆ in Proposition 2 must be 2. This is the reason why the dataset can be separated by the sign of the second PC scores in Figure 10.

CONCLUDING REMARKS
In this article, we considered the mixture model by Equation (1) in high-dimensional settings. We studied asymptotic properties of both the true PC scores and the sample PC scores for the high-dimensional mixture model. We gave conditions under which PCA is very effective for clustering HDLSS data. We showed that HDLSS data can be classified by the sign of the first several PC scores theoretically. However, we have to say, in actual HDLSS data analyses, one may encounter cases such as in Figures 6c and 10, where the dataset is not always classified by the sign of the first several PC scores. Several reasons should be considered: (i) actual HDLSS datasets often include several outliers, (ii) the regularity conditions are not met, and (iii) the mixing proportions i s are quite unbalanced. Thus, we recommend the following three steps: (i) apply PCA to HDLSS data; (ii) using PC scores, map the dataset onto a feature space such as the first three eigenspaces, and (iii) apply general clustering methods such as the k-means method to the feature space. However, the number of clusters k is unknown in general. We emphasize that the first k − 1 eigenvalues are quite spiked for the model (1). Recently, Jung, Lee, and Ahn (2018) proposed a test of the number of spiked components for high-dimensional data. Thus, one may apply the test to the choice of k for clustering. Shen, D., Shen, H., Zhu, H., & Marron, J. S. (2016)

APPENDIX B. ADDITIONAL PROPOSITION
When Condition 5 is not met, Theorem 3 does not hold. However, in Figure 5c, we could find three separate clusters of Π i , i = 1, 2, 3, even though Condition 5 is not met. To explain the reason of this phenomenon, we give the following result.
Proposition 3. Assume Conditions 2-4 and 6. Then, under the condition: Proof. By noting thatû T i 1 n = 0 for i = 1, … , k − 1 when rank(S D ) ≥ k − 1, from Equation (A13), we can conclude the result. ▪ By noting thatû i = (ẑ i1 , … ,ẑ in ) T ∕n 1∕2 , from Proposition 3, for sufficiently large d, the estimated PC scores depend only on the structure of V T V even when Condition 5 is not met. Then, as rank(V T V ) = k − 1, there must be k separate clusters for Π i , i = 1, … , k, in the first k − 1 PC spaces as seen in Figure 5c.