A biological evaluation of six gene set analysis methods for identification of differentially expressed pathways in microarray data.

Gene-set analysis of microarray data evaluates biological pathways, or gene sets, for their differential expression by a phenotype of interest. In contrast to the analysis of individual genes, gene-set analysis utilizes existing biological knowledge of genes and their pathways in assessing differential expression. This paper evaluates the biological performance of five gene-set analysis methods testing “self-contained null hypotheses” via subject sampling, along with the most popular gene-set analysis method, Gene Set Enrichment Analysis (GSEA). We use three real microarray analyses in which differentially expressed gene sets are predictable biologically from the phenotype. Two types of gene sets are considered for this empirical evaluation: one type contains “truly positive” sets that should be identified as differentially expressed; and the other type contains “truly negative” sets that should not be identified as differentially expressed. Our evaluation suggests advantages of SAM-GS, Global, and ANCOVA Global methods over GSEA and the other two methods.


Introduction
Analytic methods of microarray data were initially formulated to identify individual genes that are differentially expressed according to a phenotype of interest [1]. Biological inference with microarray data, however, often focuses on the identifi cation and interpretation of pathways (or gene sets) that are differentially expressed according to a phenotype. Prior to the publication of Gene Set Enrichment Analysis (GSEA) in 2003 [2], such pathway-level inference was conducted unsystematically, often subjectively and manually by investigators going through the results of an individual-gene analysis. We would like to emphasize that the methods considered in this paper focus on identifying a-priori defi ned pathways, and not searching for statistically signifi cant gene sets in Gene Ontology, by taking into account its hierarchical structure. However, gene set analysis methods considered here can be applied to a collection of a-priori defi ned gene sets from Gene Ontology.
GSEA proposed a systematic approach for evaluating gene sets for their differential expression between two classes of a phenotype. Using the Kolmogorov-Smirnov statistic, GSEA assesses the degree of "enrichment" of a set of genes (e.g. a pathway) in the entire range of the strength of associations with the phenotype of interest. GSEA has been modifi ed in 2005 [3] and has been used widely in gene-set analyses of microarray data. Following the proposal of GSEA, a number of gene-set analysis methods have been proposed.
The goal of this paper is to compare the performance of various gene-set analysis methods biologically. Our strategy for the biological comparison is to use microarray data with phenotypes that are known to be associated with certain gene sets (pathways). We used the 60 human cancer cell lines microarray expressions dataset (the NCI-60), assembled by the National Cancer Institute for anticancer drug discovery. To defi ne the phenotype, we utilized the list of mutation status for 56 of the 60 cell lines provided for each of 24 genes studied by Ikediobi et al. [4]. We restricted our attention to genes where the mutation occurred in more than 10 cell-lines so that the performances of gene-set analysis methods can be compared in a statistically meaningful manner. There were four such genes among the 24 genes: 40 cell lines with mutant p53 gene and 16 cell lines with wild-type p53 gene; 31 cell lines with mutant Cyclin-dependent kinase inhibitor 2A (CDKN2A or p16) gene and 25 cell lines with wild-type CDKN2A gene; 11 cell lines with mutant Phosphatase and tension homologue (PTEN ) gene and 45 cell lines with wild-type PTEN gene; 11 cell lines with mutant Kirsten rat sarcoma viral oncogene homolog (KRAS ) gene and 45 cell lines with wildtype KRAS gene. For the CDKN2A mutation based phenotype, for example, pathways (or gene sets) that involve, or are closely related to, CDKN2A have a clear biological basis for being differentially expressed between the mutant-class and the wildtype-class of the phenotype of interest. On the other hand, pathways (or gene sets) that are far from CDKN2A's functions have a biological basis for not being differentially expressed between the two classes of the phenotype. Using these pathways that are "biologically expected" or "biologically unexpected" to be associated with the phenotype, we can compare sensitivity (true positive rate) for identifying the "biologically expected" pathways as differentially expressed and specifi city (true negative rate) for identifying the "biologically unexpected" pathways as not differentially expressed, across various gene-set analysis methods under the framework of Receiver Operating Characteristic (ROC) Analysis [5]. Here we used three microarray datasets corresponding to phenotypes defi ned by CDKN2A, PTEN, and p53. We did not study KRAS-defi ned phenotype comparison because there were only four biologically expected gene sets for KRAS, using the defi nition of "truly positive" gene sets described in Materials and Methods, which was insuffi cient for any statistical evaluation.
The recent review of gene-set analysis methods by Goeman and Bühlmann [6] provided an excellent summary of the methods. They made an important distinction among the gene-set analysis methods: those testing "self-contained null hypotheses" via. subject sampling and those testing "competitive null hypotheses" via. gene sampling. They pointed out that the competitive hypothesis testing via. gene sampling is subject to serious errors in calculating and interpreting statistical signifi cance of gene sets, because of its implicit or explicit untenable assumption of stochastic independence across genes. Following the argument of Goeman and Bühlmann, we consider in this paper fi ve methods that test "self-contained null hypotheses" via. subject sampling. We also include GSEA in the comparison as it is the most commonly-used method of gene-set analysis to date. The fi ve methods that test "self-contained null hypotheses" via. subject sampling are: SAM-GS [7]; Global Test [8]; ANCOVA Global Test [9]; the method of Tian et al. [10]; the method of Tomfohr et al. [11]. Briefl y, SAM-GS by Dinu et al. [7] is a gene-set analysis method that extended the individual-gene analysis method of SAM to gene-set analyses. It can also be seen as a modifi cation of Dempster's Test [13] for the two-sample multivariate mean comparison under a small-sample setting where the standard Hotelling's T cannot be applied. Global Test was proposed by Goeman et al. [8] modeling and testing differential gene expression by use of randomeffects logistic regression models. Mansmann and Meister [9] proposed ANCOVA Global Test in which the roles of phenotype and genes were exchanged in the regression modeling framework of Global Test. Tian et al. [10] assessed the significance of a gene set by taking the mean of t-test statistic values of genes in the gene set as a gene-set test statistic and evaluating its signifi cance by a permutation of phenotype labels. Tomfohr et al. [11] reduced the gene set's expression into a single summary value by taking the fi rst principal component of expressions of genes in the gene set and performed a phenotype-label permutation test of the single summary. GSEA was initially proposed by Mootha et al. [2] using the Kolmogorov-Smirnov statistic to quantify the degree of "enrichment" of a set of genes in the entire range of the strength of associations with the phenotype. GSEA was later modifi ed by Subramanian et al. [3]. Although GSEA is not a method for testing "self-contained null hypotheses" via. subject sampling, we included it here for comparison, as it is the most widely-used method of gene-set analyses.
For Global Test and ANCOVA Global Test, we previously found that it is necessary to standardize gene expression to eliminate the dominance within a gene-set by its member(s) with large variances [14]. We used the following equation to standardize the gene expression scores:

Results
To compare the performance of the six methods, we used three gene-mutation based phenotypes (i.e. mutated vs. wild-type) in the NCI-60 microarray data: CDKN2A; PTEN; and p53. The gene sets accompanying the datasets were those used by Subramanian et al. [3]. For each of the three genes that defi ned the phenotype, we consider those gene sets containing that specifi c gene as "truly positive", in the sense that a good gene set analysis method should identify those gene sets as being associated with the phenotype. For CDKN2A, there were 17 such gene sets. To defi ne the "truly negative" gene sets, we search for genes whose frequency of appearance in the gene sets catalog was also 17, and identifi ed 4 such genes. When searching PUBMED for links of each of these 4 genes with CDKN2A, we found that only PRKACB (protein kinase, cAMPdependent, catalytic, beta) has no close link with CDKN2A, the gene defi ning the phenotype. We used the 17 gene sets, containing PRKACB, as "truly negative" gene sets. For the other two phenotypes, the "truly positive" and "truly negative" pathways were defi ned in the same way. For PTEN, there were 18 "truly positive" pathways; 5 other genes appeared 18 times in the gene sets catalog. We used PRKAR2B (protein kinase, cAMP-dependent, regulatory, type II, beta) to defi ne the "truly negative" pathways, as we did not fi nd any close links of this gene with PTEN, the gene defi ning the phenotype. For p53, there were 25 "truly positive" pathways; only one other gene appeared 25 times in the gene sets catalog: RAC1 (ras-related C3 botulinum toxin substrate 1). We used RAC1 to defi ne the "truly negative" pathways, as we did not fi nd any close links of this gene with p53, the gene defi ning the phenotype. In the remaining of this section, we present the performance of the six gene set analysis methods on identifying these "truly positive" and "truly negative" gene sets.
Identifi cation of "truly positive" pathways: sensitivity of the methods Table 1 shows the p-values of 17 "truly positive" pathways for differential expression by the  Table 2, although SAM-GS found more signifi cant gene sets than Global Test, and ANCOVA Global Test. The results for p53 comparison, presented in Supplementary File, Table S1 online, agreed closely for SAM-GS, Global Test, and ANCOVA Global Test, and the majority of the 25 pathways showed statistically significant differential expression by the phenotype with p-value Ͻ0.05. On the other hand, the other three gene-set analysis methods identifi ed only a few "truly positive" pathways with p-value Ͻ0.05.
Identifi cation of "truly negative" pathways: specifi city of the methods Table 3 shows the p-values of 17 "truly negative" pathways for differential expression by the phenotype, for CDKN2A mutation vs. wild-type comparison, according to the six gene-set analysis methods. Under the binomial assumption in a random sampling of 17 observations with p = 0.05 of "failure", the probability observing 0, 1, 2, 3 "failure" in 17 observations are 0.42, 0.37, 0.16, and 0.04, respectively. None of the six gene-set analysis methods showed inconsistent results with the expected false positive numbers based on the binomial assumption. The results of the PTEN mutation vs. wild type analysis are presented in Table 4 for the 18 "truly negative" pathways for differential expression by the phenotype, according to the six gene-set analysis methods. Under the binomial assumption in a random sampling of 18 observations with p = 0.05 of "failure", the probability observing 0, 1, 2, 3 "failure" in 18  Table S2 online. Under the binomial assumption in a random sampling of 25 observations with p = 0.05 of "failure", the probability observing 0, 1, 2, 3 "failure" in 25 observations are 0.28, 0.37, 0.23, and 0.09, respectively. None of the six gene-set analysis methods showed inconsistent results with the expected false positive numbers based on the binomial assumption.

ROC analysis of the performance of the six gene-set analysis methods
Since the p-value threshold of 0.05 is arbitrary, we applied the ROC analysis to the classifi cation of the "truly positive" and "truly negative" pathways that is independent of a specifi c choice of the threshold. That is, for each of the six gene-set analysis methods, we used the p-value for differential expression as the classifi er of the two classes of pathways and evaluated the classifi cation performance by the area under the ROC curve.  [14] the standardized versions of global tests are more powerful than the un-standardized versions, although the differences in performance may not be too large.
Any method that provides correct p-values can be used in conjunction with FDR methods such as the q-value method [15]. We included the FDR values for GSEA in our presentation, because it has a specifi c way of computing these values, and also because GSEA is the most popular method. Although we choose the p-values to rank the methods, we expect similar ranking based on the FDR values.

Discussion
Proper evaluation of bioinformatics methods for microarray data analysis is not simple to perform. Simulation studies are useful for evaluating properties of methods under certain simplifi ed conditions. It is, however, not possible to simulate complex correlations and noise properties that exist in real microarray data in measuring gene expression. On the other hand, evaluating data analysis methods empirically based on real microarray datasets is also subject to limitations, among which the most critical ones would perhaps be the question of generalizability of fi ndings due to the use of specifi c datasets and not knowing what the underlying true expression profi les are. In this paper, we chose to evaluate the six gene-set analysis methods empirically using three analyses with the NCI-60 microarray data, each of the three corresponding to a phenotype defined based on mutated vs. wild-type of known cancer genes: CDKN2A, PTEN, and p53. Our goal was to compare gene-set analysis methods based on biological criteria. Although our strategy does not overcome the generalizability limitation, we were able to address the issue of unknown underlying expression profi les by utilizing the phenotype defi ned by mutation of a gene and analyzing biologically expected, and unexpected, differentially expressed gene sets. The comparison of the six methods with respect to the true-positive and true-negative rates showed varying biological performance of these gene-set analysis methods, suggesting advantages of SAM-GS, Global, and ANCOVA Global methods over GSEA and the other two methods. These results are consistent to methodological features of the six methods. We provide two methodological remarks here that are relevant to the interpretation of the observed results. First, the six methods do not consider directions of the gene expression and the phenotype of interest in the same way. Both the method of Tian et al. and GSEA separate positive (over-expressed) and negative (under-expressed) associations of gene expression with the phenotype. That is, if some genes in a pathway are overexpressed and others are under-expressed by the phenotype, they work towards canceling each other in measuring the strength of association in these two methods. On the other hand, SAM-GS, Global Test, and ANCOVA Global Test take both directions of association as an indication of the association. Thus, pathways with a mixture of over-expressed and under-expressed genes are more likely to be identifi ed as being signifi cantly associated with the phenotype by these three methods than the method of Tian et al. or GSEA. The method of Tomfohr et al. initially reduces the gene expression of a pathway by taking the fi rst principal component of the pathway's gene expression without considering the phenotype. Thus, unless the direction of the fi rst principal component is the direction along which the phenotype-associated expression differences appear, the method does not capture the phenotype-associated differential gene expression. Second, the six methods are not testing the same statistical null hypothesis. In the case of GSEA, for example, the null hypothesis tested is that genes of a pathway are not clustered along the axis of an association measure, such as correlation between the phenotype and gene expression [7]. This null hypothesis does not correspond to hypotheses of biological interest: if the genes of a pathway are clustered around the correlation values of zero, for instance, GSEA would still identify such pathways as being associated with the phenotype. In SAM-GS, Global Test, and ANCOVA Global Test, the null hypothesis is properly formulated statistically in three different ways.
There is a caveat in our approach to defi ning the "truly positive" and "truly negative" gene sets. In the p53 analysis, for example, we took the gene sets that include TP53 as "biologically expected truly positive" gene sets and those that include RAC1 as "biologically expected truly negative" gene sets. We recognize that the appropriateness of the selection of these pathways may be debatable. In this example, there is one gene set, "arf pathway", overlapping between the truly positive and truly negative gene sets. None of the six methods found evidence of signifi cance for this pathway. There is no overlap in the CDKN2A analysis. In the PTEN analysis, "ins pathway" was listed as both "truly positive" and "truly negative". PTEN is a tumour suppressor involved in cell cycle progression as an inhibitor of insulin (INS) signaling.
In roundworms, a PTEN homologue has been related to development and longevity regulation through the INS-like pathway [12]. In addition to the overlapping issue, our defi nition of "truly positive" needs a remark. We considered gene sets as "truly positive" if they included the gene whose mutation defi ned the phenotype. For example, by including the CDKN2A gene that plays key biological functions in cell cycle regulations, the gene sets with CDKN2A are likely to be infl uenced by its mutation, at least partially, and, therefore, serve as biologically tenable "truly positive" gene sets. Note, however, that these gene sets may not necessarily be regulated primarily by the phenotypedefi ning gene.
In summary, our biological evaluation illustrated some appreciable performance differences among the six gene-set analysis methods. Our evaluation results are consistent throughout the three datasets in the sense that, SAM-GS, Global Test, and ANCOVA Global Test performed considerably better than the other three methods. More biologicallyoriented evaluations of microarray analysis methods are needed, including those for gene-set analysis, for identifying truly effective gene-expressionanalysis tools for biology and medicine.

Materials and Methods
The phenotypes being compared in the microarray dataset To compare performance of the six methods, we obtained the NCI-60 microarray dataset from http://discover.nci.nih.gov/cellminer [16]. The microarray experiment of the NCI-60 was conducted by hybridizing 60 cancer cell lines' mRNAs to Affy U95(A-E) chip. The expression data were normalized using RMA [17]. These arrays contain 49,064 ProbeSets and their expressions were reduced from the probe level to the gene level of 17,693 unique genes by a method described in the GSEA website [http://www.broad.mit.edu/gsea], by taking the maximum probe set expression of each gene in each sample. The mutation status of each cell line was based on the analysis of Ikediobi et al. [4]. According to Ikediobi et al. 59 of the 60 cell lines were made available, and a total of 56 independent cell lines were used for the mutation analysis: there were three pairs among the 59 cell lines that seemed to have been derived from the same individual. The synonymous pairs are the following: (a) ''breast'' cancer line NCI/ADR-RES and the ovarian cancer OVCAR-8, which have identical TP53 and ERBB2 variants and 99% genotype similarity; (b) the melanoma line M14 and the ''breast'' cancer line MDA-MB-435, which have identical BRAF, CDKN2A, and TP53 variants and 97% genotype similarity; and (c) two glioma lines SNB-19 and U251, which have identical TP53, CDKN2A, and PTEN variants and 96% genotype similarity. Lists of mutation status for each of the 56 cell lines were provided for each of 24 cancer genes studied by Ikediobi et al. [4]. We restricted our attention to four of the 24 genes where the mutation occurred in more than 10 celllines: p53 (40 mutated vs. 16 wild-type); CDKN2A (31 mutate vs. 25 wild-type); PTEN and KRAS (both 11 mutated vs. 45 wild-type). Of these four genes, we did not study KRAS-defi ned phenotype comparison because there were only four "truly positive" gene sets for KRAS, using the defi nition of "truly positive" gene sets described in the next two subsections, which was insuffi cient for any statistical evaluation. sets containing gene sets reported in manually curated databases and 50 sets containing genes reported in various experimental papers. Following the GSEA paper [3], we restricted the set size to be between 15 and 500, resulting in 310 pathways to be examined.

CDKN2A mutation vs wild-type
The phenotype of interest in this microarray experiment was defi ned by mutation of a specifi c gene, CDKN2A. Thus, gene sets that are expected on the basis of biology to be differentially expressed include those which involve CDKN2A gene as a gene set member. We considered gene sets that contain CDKN2A as a gene set member as "truly positive" gene sets. In the CDKN2A analysis, there were 17 such gene sets. A good gene-set analysis method should identify these gene sets as having differential expression between the mutant and wild-type classes. Regarding "truly negative" pathways, we searched the list of all genes on the microarray and identifi ed 4 more genes whose frequency of appearances in the 310 pathways was also 17. A search on PUBMED indicated PRKACB (protein kinase, cAMP-dependent, catalytic, beta) as the only gene among the four with no close link with CDKN2A. We, therefore, used the 17 pathways that included PRKACB as "truly negative" pathways.
As a measure of gene-set analysis performance, statistical signifi cance (p-value) of differential expression by the phenotype was computed for each of the 34 pathways by each of the six methods and tabulated. In the case of GSEA, which is not a method for testing "self-contained null hypotheses" via. subject sampling, both p-values and False Discovery Rate (FDR) provided by Subramanian et al. [3] were tabulated. In addition, we evaluated the classifi cation performance of the six methods by the ROC analysis [4]. Conditioned on the fact that the 17 "truly positive" gene sets include CDKN2A as a member, we assumed conditional independence among the 17 gene sets and estimated the sensitivity, or true-positive rate, of classifi cation (i.e. the proportion of the 17 "truly positive" gene sets which were correctly classifi ed as truly positive), given a threshold of p-value. Similarly, conditioned on the fact that the 17 "truly negative" gene sets include PRKACB as a member, we assumed conditional independence among the 17 "truly negative" gene sets and estimated the specifi city, or true-negative rate, of classifi cation (i.e. the proportion of the 17 "truly negative" gene sets which were correctly classifi ed as truly negative), given a threshold of p-value. By varying the threshold, we can draw an ROC curve, a curve that goes through points of (sensitivity, 1-specifi city) across the whole range of classifi cation threshold. The area under the ROC curve of each of the six methods was calculated by Mann-Whitney Statistic, and pairwise differences of the area under the ROC curve among the six methods were tested by the method of DeLong et al. [18]. Detailed calculations of the confi dence intervals and pairwise differences tests are illustrated on a working example in the above mentioned reference. We used STATA to run these analyses.

PTEN mutation vs wild-type
The phenotype of interest in this analysis was defi ned by mutation of PTEN. We considered gene sets that contain PTEN as a member as "truly positive" gene sets. In the PTEN dataset, there were 18 such gene sets. Regarding "truly negative" gene sets, we searched the list of all genes on the microarray and identifi ed 5 more genes whose frequency of appearances in the 310 pathways was also 18. A search on PUBMED indicated PRKAR2B (protein kinase, cAMP-dependent, regulatory, type II, beta) as the only gene among the fi ve with no close link with PTEN. We, therefore, used the 18 gene sets that included PRKAR2B as "truly negative" gene sets. The ROC analysis was conducted analogous to the CDKN2A analysis.
TP53 mutation vs wild-type The phenotype of interest in this analysis was defi ned by mutation of TP53. We considered gene sets that contain TP53 as a member as "truly positive" gene sets . In the p53 analysis, there were 25 such gene sets. Regarding "truly negative" gene sets, we searched the list of all genes on the microarray and identifi ed one more gene whose frequency of appearances in the 310 pathways was also 25, the same frequency as TP53. This gene was RAC1 (ras-related C3 botulinum toxin substrate 1) for which we did not fi nd close link with p53. We, therefore, used the 25 pathways that included RAC1 as "truly negative" gene sets. The ROC analysis was conducted analogous to the CDKN2A analysis.

Contributions
ID and YY developed the SAM-GS methodology and designed/conducted the methodological study. TM, GE, KF, GJ, and PH performed the mouse transplant microarray study and introduced the gene-set analysis problem to ID and YY, including the GSEA methodology. JP provided biological interpretations of the analysis results of three realworld datasets. QL and AA contributed significantly to data analysis, refi nement of SAM-GS, and programming. The manuscript was written primarily by ID, YY, and JP, and critically reviewed and revised by all authors. All authors read and approved the fi nal manuscript.