Effective Methodologies for Statistical Inference on Microarray Studies

A common feature of high-dimensional data such as genetic microarrays is that the data dimension is extremely high, however the sample size is relatively small. This type of data is called the high-dimension, low-sample-size (HDLSS) data. Such HDLSS data present with substantial challenges to reconsider existing methods in the multivariate statistical analysis. Unfortunately, it has been known that most conventional methods break down in HDLSS situations and alternative methods are often highly sensitive to the curse of dimensionality. In this chapter, we present modern statistical methodologies that are very effective to draw statistical inference fromHDLSS data. We focus on a series of effective HDLSS methodologies developed by Aoshima and Yata (2011) and Yata and Aoshima (2009, 2010a,b, 2011a,b). We demonstrate how those methodologies perform well and bring a new insight into researches on prostate cancer. In Section 2, we first consider Principal Component Analysis (PCA) for microarray data to visualize a data structure having tens of thousands of dimension by projecting on a few dimensional PC space. We note that classical PCA cannot sufficiently visualize a latent structure of microarray data because of the curse of dimensionality. We overcome the difficulty with the help of the cross-data-matrix (CDM) methodology that was developed by Yata and Aoshima (2010a,b). Next, in Section 3, we consider an effective clustering for microarray data. We apply the CDM methodology to estimating the principal component (PC) scores. We show that a clustering method given by using a CDM-based first PC score effectively classifies individuals into two groups. We demonstrate accurate clustering by using prostate cancer data given by Singh et al. (2002). Further, in Section 4, we consider an effective classification formicroarray data. We pay special attention to the quadratic-type classification methodology developed by Aoshima and Yata (2011). We give a sample size determination for the classification so that the misclassification rates are controlled by a prespecified upper bound. We examine how the classification methodology performs well by using some microarray data sets. Finally, in Section 5, we consider a variable selection procedure to select a set of significant variables from microarray data. In most gene expression studies, it is important to select relevant genes for classification so that researchers can identify the smallest possible set of genes that can still achieve good predictive performance. We implement the two-stage 2


Introduction
A common feature of high-dimensional data such as genetic microarrays is that the data dimension is extremely high, however the sample size is relatively small. This type of data is called the high-dimension, low-sample-size (HDLSS) data. Such HDLSS data present with substantial challenges to reconsider existing methods in the multivariate statistical analysis. Unfortunately, it has been known that most conventional methods break down in HDLSS situations and alternative methods are often highly sensitive to the curse of dimensionality. In this chapter, we present modern statistical methodologies that are very effective to draw statistical inference from HDLSS data. We focus on a series of effective HDLSS methodologies developed by Aoshima and Yata (2011) and Yata and Aoshima (2009, 2010a,b, 2011a. We demonstrate how those methodologies perform well and bring a new insight into researches on prostate cancer. In Section 2, we first consider Principal Component Analysis (PCA) for microarray data to visualize a data structure having tens of thousands of dimension by projecting on a few dimensional PC space. We note that classical PCA cannot sufficiently visualize a latent structure of microarray data because of the curse of dimensionality. We overcome the difficulty with the help of the cross-data-matrix (CDM) methodology that was developed by Yata and Aoshima (2010a,b). Next, in Section 3, we consider an effective clustering for microarray data. We apply the CDM methodology to estimating the principal component (PC) scores. We show that a clustering method given by using a CDM-based first PC score effectively classifies individuals into two groups. We demonstrate accurate clustering by using prostate cancer data given by Singh et al. (2002). Further, in Section 4, we consider an effective classification for microarray data. We pay special attention to the quadratic-type classification methodology developed by Aoshima and Yata (2011). We give a sample size determination for the classification so that the misclassification rates are controlled by a prespecified upper bound. We examine how the classification methodology performs well by using some microarray data sets. Finally, in Section 5, we consider a variable selection procedure to select a set of significant variables from microarray data. In most gene expression studies, it is important to select relevant genes for classification so that researchers can identify the smallest possible set of genes that can still achieve good predictive performance. We implement the two-stage 2 variable selection procedure, developed by Aoshima and Yata (2011), that provides screening of variables in the first stage. We select a significant set of associated variables from among a set of candidate variables in the second stage. We show that the selection procedure assures a high accuracy by eliminating redundant variables. We identify predictive genes to classify patients according to disease outcomes on prostate cancer.

PCA for high-dimension, low-sample-size data
Suppose we have a p × n data matrix X = [x 1 , ..., x n ] with p > n, where x k = (x 1k , ..., x pk ) T , k = 1, ..., n, are independent and identically distributed as a p-dimensional distribution having mean μ and positive-definite covariance matrix Σ. The eigen-decomposition of Σ is given by Σ = HΛH T , where Λ is a diagonal matrix of eigenvalues λ 1 ≥ · · · ≥ λ p (> 0) and H = [h 1 , ..., h p ] is a matrix of corresponding eigenvectors. Then, Z = Λ −1/2 H T (X −[μ, ..., μ]) is considered as a p × n sphered data matrix from a distribution with zero mean and the identity covariance matrix. Here, we write Z = [z 1 , ..., z p ] T and z j = (z j1 , ..., z jn ) T , j = 1, ..., p. We assume that the fourth moments of each variable in Z are uniformly bounded and ||z j || = 0 for j = 1, ..., p, where || · || denotes the Euclidean norm. We note that the multivariate distribution assumed here does not have to be a normal distribution, N p (μ, Σ), and the random variables in Z do not have to be regulated by a ρ-mixing condition. We consider a general setting as follows: Here, a j (> 0), c j (> 0) and α j (α 1 ≥ · · · ≥ α m > 0) are unknown constants preserving the ordering that λ 1 ≥ · · · ≥ λ p , and m is an unknown positive integer. The model (1) is an extension of a multi-component model or spiked covariance model given by Johnstone and Lu (2009). This is a quite general model for high-dimensional data. For example, a mixture model given by (6) in Section 3 is one of the examples that have the model (1) as in (7). One would also find the model (1) in a highly-correlated, high-dimensional data analysis such as graphical models, high dimensional regression models, and so on.
The sample covariance matrix is given by Note that S D and S share non-zero eigenvalues. Letλ 1 ≥ · · · ≥λ n−1 (≥ 0) be the eigenvalues of S D . Let us write the eigen-decomposition of S D by S D = ∑ n−1 j=1λ jûjû T j , whereû j 's are the corresponding eigenvectors ofλ j such that ||û j || = 1 andû T iû j = 0 (i = j).

Naive PCA in HDLSS situations
Yata and Aoshima (2009) gave sufficient conditions to claim the consistency property for the sample eigenvalues: For j = 1, ..., m, it holds that under the conditions: (YA-i) p → ∞ and n → ∞ for j such that α j > 1; (YA-ii) p → ∞ and p 2−2α j /n → 0 for j such that α j ∈ (0, 1]. Here, p → denotes the convergence in probability. If z jk , j = 1, ..., p (k = 1, ..., n) are independent, the above conditions are improved by the necessary and sufficient conditions as follows: (YA-i') p → ∞ and n → ∞ for j such that α j > 1; For the details including the limiting distribution ofλ j , see Yata and Aoshima (2009). If the population distribution is N p (μ, Σ), one may consider that z jk , j = 1, ..., p (k = 1, ..., n) are independent. When α j > 1, the sample size n is free from p in (YA-i) or (YA-i'). However, when α j ∈ (0, 1], one would find difficulty in naive PCA in view of (YA-ii) or (YA-ii') in HDLSS data situations. Let us see a simple case that p = 10000, λ 1 = p 1/2 and λ 2 = · · · = λ p = 1. Then, we observe from (YA-ii) that it should be n >> p 2−2α 1 = p = 10000. It is somewhat inconvenient for the experimenter to handle PCA in HDLSS data situations.
Remark 2.2. The condition (ii) given by Theorem 2.1 (or Theorem 2.2) is a sufficient condition for the case of α j ∈ (0, 1/2]. If more information is available about the population distribution, the condition (ii) can be relaxed to give consistency under a broader set of (p, n). For example, when the population distribution is N p (μ, Σ), the asymptotic properties are claimed under a broader set of (p, n) given by (ii) of Corollary 2.1 (or Corollary 2.2).

Remark 2.3.
In view of Theorem 2.1 compared to (2), the CDM methodology successfully relaxes the condition for the case that α j > 1/2. The conditions given by Theorem 2.1 are not continuous in α j at α j = 1/2. On the other hand, the conditions given by Corollaries 2.1 and 2.2 are continuous in α j . When we apply the CDM methodology, we simply divided X into x 1 , ..., x n (1) and x n (1) +1 , ..., x n in (3). In general, there exist n C n (1) ways to divide X into X 1 and X 2 . The CDM methodology can be generalized as follows: [Generalized cross-data-matrix (GCDM) methodology] (Step 1) Set iteration number T. Set t = 1.

Performances
We observed that naive PCA requires the sample size n depending on p for α i ∈ (1/2, 1] in (2). On the other hand, the CDM methodology allows the experimenter to choose n free from p for the case that α i > 1/2 as in Theorem 2.1 or Corollary 2.1. The CDM methodology might make it possible to give feasible estimation of eigenvalues for HDLSS data with extremely small n compared to p. We first considered a normal distribution case. Independent pseudorandom normal observations were generated from N p (0, Σ) with p = 1600. We considered (1). We used the sample of size n = 20(20)120 to define the data matrix X : p × n for the calculation of S D in naive PCA, whereas we divided the sample into X 1 : p × n (1) and X 2 : p × n (2) for the calculation of S D (1) in the CDM methodology. The findings were obtained by averaging the outcomes from 1000 (= R, say) replications. Under a fixed scenario, suppose that the r-th replication ends with estimates of λ j ,λ jr andλ jr (r = 1, ..., R), given by naive PCA and the CDM methodology. Let us simply We considered two quantities, A:λ j /λ j and B:λ j /λ j . Figure 1 shows the behaviors of both A and B for the first two eigenvalues. By observing the behavior of A, naive PCA seems not to give a feasible estimation within the range of n. The sample size n was not large enough to use the eigenvalues of S D for such a high-dimensional space. On the other hand, in view of the behavior of B, the CDM methodology gave a reasonable estimation surprisingly well for such HDLSS data sets. The CDM methodology seems to perform excellently as expected theoretically. Next, we considered a non-normal distribution case.
Independent pseudorandom observations were generated from a p-variate t-distribution, t p (0, Σ, ν), with mean zero, covariance matrix Σ and degree of freedom ν = 15. We considered the case that λ 1 = p 2/3 , λ 2 = p 1/3 and λ 3 = · · · = λ p = 1 in (1) as before. We fixed the sample size as n = 60. We set the dimension as p = 400(400)2000. Similarly to Figure 1, the findings were obtained by averaging the outcomes from 1000 replications. Figure 2 shows the behaviors of two quantities, A:λ j /λ j and B:λ j /λ j , for the first two eigenvalues. Again, the CDM methodology seems to perform excellently as expected theoretically. One can observe the consistency of λ j for all p = 400(400)2000. We conducted simulation studies for other settings as well and verified the superiority of the CDM methodology to naive PCA in various HDLSS data situations.

Clustering for high-dimension, low-sample-size data
Suppose we have a mixture model to classify a data set into two groups. We assume that the observation is sampled with mixing proportions w j 's from two populations, Π 1 and Π 2 , and the label of the population is missing. We consider a mixture model whose p.d.f. (or p.f.) is given by where w j 's are positive constants such that of Π i having mean vector μ i and covariance matrix Σ i . Let μ and Σ be the mean vector and the covariance matrix of the mixture model. Then, we have that We suppose that x k , k = 1, ..., n, are independently taken from (6) and define a p × n data matrix Let λ 11 and λ 21 be the largest eigenvalues of Σ 1 and Σ 2 . We assume that Δ = cp β , where c and β are positive constants. We assume that λ 11 /Δ → 0 and λ 21 /Δ → 0 as p → ∞. Then, as for the largest eigenvalue, λ 1 , of Σ and the corresponding eigenvector, h 1 , we have that We note from (7) that the mixture model given by (6) Thus one would be able to classify the data set {x 1 , ..., x n } into two groups if s 1k is effectively estimated in HDLSS data situations. In this section hereafter, we borrow symbols from Section 2.
One can calculate the singular vectorũ j(i) 's by the eigenvectors of S D(i) S T D(i) . Let MSE(s j ) = n −1 ∑ n k=1 (s jk − s jk ) 2 denote the sample mean-square error of the j-th PC score. Note that Var(s jk ) = λ j . Then, Yata and Aoshima (2010b) gave the following properties on the CDM-based PC scores.
The CDM-based PC score can be generalized as follows: [GCDM methodology for PC scores] (Step 1) Set iteration number T. Set t = 1.

(Step 7)
If t < T, put t = t + 1 and go to Step 2; otherwise go to Step 8.

(Step 8)
Estimate the j-th PC score of x k bys jk(T) = ∑ T t=1sjkt /T for each j and k.

Demonstration
We analyzed gene expression data about prostate cancer given by Singh et al. (2002). Refer to Pochet et al. (2004) for details of the data set. The data set consisted of 12600 (= p) genes and 34 microarrays in which there were 9 samples from Normal Prostate and 25 samples from Prostate Tumors. As for Prostate Tumors, we chose the first 9 samples and set 18 (= n) microarrays in which there were 9 samples from Normal Prostate and 9 samples from Prostate Tumors. We assumed the mixture model given by (6) for the data set. We defined the data matrix by X : 12600 × 18. We set (n (1) , n (2) ) = (9,9) and T = 1000. We focused on the first three PC scores. We randomly divided X into X 1 : 12600 × 9 and X 2 : 12600 × 9, and calculateds jkt , k = 1, ..., 18, for j = 1, 2, 3. According to the GCDM methodology, we repeated this operation T = 1000 times and obtaineds jk(T) , k = 1, ..., 18; j = 1, 2, 3, as an estimate of the j-th PC score of x k . We also obtained (λ 1(T) ,λ 2(T) ,λ 3(T) ) = (2.77 × 10 8 , 1.62 × 10 8 , 6.34 × 10 7 ). Figure 3 gives the scatterplots of the first three PC scores on the (PC1, PC2) plane or the (PC1, PC3) plane. As observed in Figure 3, Normal Prostate (blue point) and Prostate Tumors (red point) seem to be separated clearly. It is obvious especially for the first PC score (PC1) line. All the first PC scores of the samples from Normal Prostate are negative, whereas those from Prostate Tumors are positive. This observation is theoretically supported by the arguments in Section 3.1.

Classification for high-dimension, low-sample-size data
Suppose we have independent and p-dimensional populations, Π i , i = 1, 2, having unknown mean vector μ i = (μ i1 , ..., μ ip ) T and unknown positive-definite covariance matrix Σ i for each i. We do not assume that Σ 1 = Σ 2 . The eigen-decomposition of Σ i (i = 1, 2) is given by is considered as a p × n i sphered data matrix from a distribution with zero mean and the identity covariance matrix. Here, we write Z i = [z i1 , ..., z in i ] and z ij = (z i1j , ..., z ipj ) T , j = 1, ..., n i . Note that E(z 2 ijl ) = 1 and E(z ijl z ij l ) = 0 for i = 1, 2; j( = j ) = 1, ..., p; l = 1, ..., n i . We assume that λ ip > 0 (i = 1, 2) as p → ∞ and the fourth moments of each variable in Z i are uniformly bounded. In this section, we assume the following assumption for Π i 's: .., p, are independent for i = 1, 2.
One of the population distributions satisfying (A-i) is N p (μ i , Σ i ). We also assume the following condition for Σ i 's as necessary:

Discriminant rule for HDLSS data
Let x 0 be an observation vector on an individual belonging to Π 1 or to Π 2 . Having recorded x i1 , ..., x in i from each Π i , we estimate μ i and Σ i by
Note that E(||x with α ∈ (0, 1/2). From the fact that Δ ≥ Δ, we design a lower bound of Δ by for sufficiently small α .

Demonstration
We analyzed gene expression data given by Chiaretti et al. (2004) in which data set consisted of 12625 (= p) genes and 128 samples. Note that the expression measures were obtained by using the three-step robust multichip average (RMA) preprocessing method. Refer to Pollard et al. (2005) as well for the details. The data set had two tumor cellular subtypes, Π 1 : B-cell (95 samples) and Π 2 : T-cell (33 samples). We set α = 0.1, β = 0.02 and m = 6. Our goal was to construct a discriminant rule ensuring that both 1 − e(2|1) ≥ 0.9 and 1 − e(1|2) ≥ 0.98 when Δ ≥ Δ L , where Δ L is designed later. First, we took the first 6 samples from each Π i as a pilot sample. According to Remark 4.5, we calculated ||x 1m − x 2m || 2 − ∑ 2 i=1 tr(S im )/m = 1890 and Var(||x 1m − x 2m || 2 ) = 87860. By setting α = 0.01 so that z α = 2.33, we designed a lower bound of Δ by According to (14), the total sample size for each Π i was given by So, we took the next 4 (= N 1 − m) samples from Π 1 . On the other hand, since N 2 = m, we did not take additional samples from Π 2 . We hadγ = (tr(S 1N 1 . We constructed the three discriminant rules by using the common data sets of sizes (N 1 , N 2 ) = (10, 6). In Table 2, we investigated those performances by using the remaining samples of sizes (95 − N 1 , 33 − N 2 ) = (85, 27) as test data sets. We observed that the discriminant rule given by (15)

Variable selection for high-dimension, low-sample-size data
Suppose we have two independent and p-dimensional populations, Π i , i = 1, 2, having unknown mean vector μ i = (μ i1 , ..., μ ip ) T and unknown positive-definite covariance matrix Σ i for each i. We do not assume Σ 1 = Σ 2 . We consider an effective methodology to select a significant set of associated variables from among high-dimensional data sets. That is, we consider testing the following hypotheses simultaneously: Our interest is to select a set of significant variables such that D = {j : μ 1j = μ 2j }. Assume that |D| = S for some S ≥ 1, where |D| denotes the number of elements in set D. A variable selection procedure D maps the data into subsets of {1, ..., p}. We are interested in designing D ensuring that both the asymptotic family-wise error rate (FWER) is 0, i.e., and the asymptotic average power (AP) is 1, i.e.,

Variable selection procedure for HDLSS data
Let σ i = max 1≤j≤p σ (i)j (i = 1, 2), where σ (i)j , j = 1, ..., p, are diagonal elements of Σ i . We assume that σ (i)j < ∞ for i = 1, 2; j = 1, ..., p, and E θ {exp(t|x ijl − μ ij |/σ 1/2 (i)j )} < ∞, i = 1, 2; j = 1, ..., p, for some t > 0. Then, for testing the hypotheses (16), we take samples, x ijl /n i for each Π i . Then, test the hypothesis for j = 1, ..., p, by One would be interested in a two-stage variable selection procedure so as to provide screening of variables in the first stage. We consider selecting a significant set of associated variables from among a set of candidate variables in the second stage. Aoshima and Yata (2011) proposed the following effective methodology: for a set of candidate variables. Let S = | D|. Define the additional sample size for each Π i by where ξ ∈ (0, 1] and ε ∈ (0, 1] are chosen constants.

(Step 2)
Regarding j ∈ D, take new samples Then, test the hypothesis by for j ∈ D, and define Select the variables regarding D.

Demonstration
We analyzed the gene expression data of Prostate Cancer that were given by Singh et al. (2002). The data took a pre-processing given by Jeffery et al. (2006). The data set consisted of 12600 (= p) genes and two groups, Π 1 : Normal Prostate (50 samples) and Π 2 : Prostate Tumors (52 samples).

Variable selection procedure
We set δ = 1.5. Our goal was to find variables j's such that |μ 1j − μ 2j | 2 > 1.5. We chose the pilot sample size for each Π i as m = 18 (= O(log p)). Then, we took the first 18 samples from each Π i as pilot samples, which are given in Table 3.
Regarding j ∈ D, we took additional samples x ijl , l = m + 1, ..., m + N, of size N = 18 from each Π i , which are given in Table 4. The required sample-size in the two-stage variable selection procedure was m + N = 36 for each Π i . On the other hand, the required sample-size in the single variable selection procedure given by (20) with (19) was n i ≥ (log p) 1+ζ /δ = 59.43 with ζ = 1.0. The two-stage variable selection procedure allows the experimenter to reduce the cost of sampling in the second stage.

Classification after variable selection
In Section 4, we considered a two-stage discriminant procedure in HDLSS data situations. In some cases the experimenter would encounter the situation that the required sample size, N i , is much larger than the available sample size if Δ = ||μ 1 − μ 2 || 2 + (tr(Σ 1 ) − tr(Σ 2 )) 2 / max i=1,2 {2tr(Σ i )} is much smaller than tr(Σ 2 i )'s. In that case, we recommend that the experimenter should consider the classification based only on the selected variables. We selected a set of significant variables by D = {556, 7412, 8662, 11552} that was given in (26). We set n 1 = n 2 = m + N = 36, where m and N were given in Section 5.2.1. Let us write the selected 4-variable data as x il = (x i556l , x i7412l , x i8662l , x i11552l ) T , i = 1, 2, for the l-th sample. Then, we considered a typical quadratic discriminant rule that classifies x 0 into Π 1 if and into Π 2 otherwise, where x 0 is an observation vector with respect to the 4 variables on an individual belonging to Π 1 or to Π 2 , x in i = ∑ n i l=1 x il /n i and S in i = ∑ n i l=1 (x il − x in i )(x il − x in i ) T /(n i − 1), i = 1, 2. We compared the discriminant rule given by (27) after variable selection with those given by (10) with γ = 0, DLDR and DQDR. Note that the three competitors were constructed by using the original (12600-variable) data without variable selection. In Table 5, we investigated those performances by using test data sets consisting of 50 − n 1 = 14 remaining samples from Π 1 and 52 − n 2 = 16 remaining samples from Π 2 . We observed that the discriminant rule given by (10) with γ = 0 showed a bad performance for x 0 classified into Π 1 : Normal Prostate. We inspected the condition of Theorem 4.1 for the data sets and found that max i=1,2 {tr(S in i (1) S in i (2) )}/( Δ 2 min i=1,2 {n i }) = 0.15 according to Remark 4.4 so that max i=1,2 {tr(Σ 2 i )}/(Δ 2 min i=1,2 {n i }) seems not to be sufficiently small. This may be a reason why Theorem 4.1 is not applicable to the present data sets. On the other hand, we observed that the discriminant rule given by (27)