Dual KS: Defining Gene Sets with Tissue Set Enrichment Analysis

Background: Gene set enrichment analysis (GSEA) is an analytic approach which simultaneously reduces the dimensionality of microarray data and enables ready inference of the biological meaning of observed gene expression patterns. Here we invert the GSEA process to identify class-specific gene signatures. Because our approach uses the Kolmogorov-Smirnov approach both to define class specific signatures and to classify samples using those signatures, we have termed this methodology “Dual-KS” (DKS). Results: The optimum gene signature identified by the DKS algorithm was smaller than other methods to which it was compared in 5 out of 10 datasets. The estimated error rate of DKS using the optimum gene signature was smaller than the estimated error rate of the random forest method in 4 out of the 10 datasets, and was equivalent in two additional datasets. DKS performance relative to other benchmarked algorithms was similar to its performance relative to random forests. Conclusions: DKS is an efficient analytic methodology that can identify highly parsimonious gene signatures useful for classification in the context of microarray studies. The algorithm is available as the dualKS package for R as part of the bioconductor project.


Background
Gene set enrichment analysis (GSEA), first described by Subramanian et al, 1 is an analytic approach which simultaneously reduces the dimensionality of microarray data and enables ready inference of the biological meaning of observed gene expression patterns. GSEA entails grouping genes together into either empirically or theoretically defined signatures. Combining genes improves the signal to noise ratio of the data since the random variation in multiple genes tend to cancel each other out. This decrease in variability favors smaller sample sizes and more efficient study design. Furthermore, to the extent that the signatures are defined in some systematic fashion, GSEA facilitates inference as to the biological processes that are relevant to the phenotype being studied. This approach has been used by ourselves and others to identify novel biologic characteristics of tumor samples. [2][3][4][5] Here we extend this analytic methodology to the task of tissue diagnosis. We invert the GSEA process described above to identify class-specific gene signatures that may subsequently be used for sample classification. Rather than looking for bias of a subset of genes in an ordered list of expression levels for a given sample, we look for bias of a subset of samples in an ordered list of samples for a given gene. By applying this methodology across the genome, we can identify those genes whose expression is most strongly biased in a given subset of samples. This can be conceptualized as "tissue-set enrichment analysis".
The approach to quantifying gene set enrichment described by Subramanian et al utilizes a variation of the Kolmogorov Smirnov rank-sum statistic (KS). KS measures how biased a subset of items is in the ordered list of all items. 1,6 In the context of GSEA, genes are sorted based on expression level for a given sample, and then KS measures the extent to which the genes in a given signature occur early or late in that ordered list as opposed to being randomly distributed throughout the list. We use tumor subtype to define the subsets and the result is a gene signature defining a specific tumor class. These signatures can then be applied to classification of unknown samples using the traditional GSEA approach. Since the GSEA approach measures the extent to which a set of genes is highly expressed in a given sample relative to all genes, it is important that the selected genes satisfy both of the following requirements: 1. The selected genes are specific: they are over-(or under-) expressed in the class of interest relative to other classes and 2. The selected genes are distinctive: The selected genes have the highest (or lowest) expression in samples of the class of interest among all over-(or under-) expressed genes.
Because our approach uses the KS approach both to define class specific signatures and to classify samples using those signatures, we have termed this methodology "Dual-KS" (DKS). Here we compare our algorithm to alternative classification methods and also examine the effects of several variations to the algorithm. Software implementing the algorithm is available as the dualKS package through the bioconductor project (http://www.bioconductor.org).
The approach has several advantages. First, it is well suited to identifying signatures amenable to use for GSEA-style classification. Second, it is noniterative and determinant; i.e. it is computationally efficient and produces exactly the same gene set from run to run. Third, it produces highly parsimonious gene signatures amenable to downstream validation. Small gene sets are desirable for focusing subsequent laboratory investigation, or when clinical assay development is contemplated using low-or mediumthroughput assay technologies. Finally, the algorithm is well suited to identifying gene signatures that are unique to a particular class of samples even where more than two classes are considered-the "multi-class case"-as is the case, for example, in renal tumors for which there are many histological subtypes to be distinguished.
Among alternative approaches to discriminant analysis and classification, we would like to highlight the gene selection methodology described by Diaz-Uriarte and Alvarez de Andres, 7 which is an adaptation of the random forest algorithm first described by Breiman. 8 This approach also is designed to identify highly parsimonious gene signatures, including unique gene signatures in the multi-class case. They have previously benchmarked their algorithm against several common alternatives, namely support vector machines (SVM), 9 K-nearest neighbors (KNN) with and without variable selection, 10 diagonal linear discrimination analysis (DLDA), 10 and nearest shrunken centroids (SC). 11 We are indebted to their thorough treatment of these alternative methodologies and we refer here to their benchmarking results for comparison.
There are a variety of other alternatives that could be considered. A more basic approach to discriminant analysis is to use statistical tests such as t-tests or wilcoxon rank sum tests to compare expression levels of each gene among two groups of samples. However, in the multi-class case, it is not always obvious which pairwise comparisons are biologically most meaningful. For example, if classes A, B, and C are to be compared, should the comparisons be A vs. C and B vs. C, or A vs. B+C and B vs. A+C? And if the former type of comparison is made, the identified gene signatures may be not be unique since the same gene may distinguish both A from C and B from C. More sophisticated methodologies such as biclustering 12 are well suited to the multi-class case, but may likewise identify genes characteristic of several (though not all) of the possible classes. The COPA algorithm, initially developed to identify candidate chromosomal translocations, 13 can identify gene pairs that are deregulated relative to normal or other reference samples in mutually exclusive subsets of disease related samples. However, these subsets have traditionally been data driven, i.e. identified based on expression levels of the gene pair in question and not based on phenotype as identified by the investigator. Presumably the approach could be adapted to sample classes fixed a priori by the investigator, in which case the approach could identify discriminant gene pairs for up to two classes of samples relative to reference samples. Finally, the PPST algorithm 14 is based on quantile scores of gene expression values in normal and disease tissues. While designed for the two class case, this algorithm has the unique feature of identifying genes that are very high in a subset of the disease samples relative to normal, but very low in a separate subset. While we do not benchmark DKS against these algorithms (because they do not identify unique gene signatures, are not designed to address the multi-class case, and/or do not describe an unique, analogous methodology for classification of new samples after gene selection), each has important strengths and represent useful alternatives in appropriate experimental contexts.
In the balance of this paper we describe in detail our algorithm and its variants. We then estimate its error rate and compare it to the previously published methods mentioned above. As will become apparent, no single methodology is suitable in every circumstance. However, our DKS algorithm can efficiently produce extremely small yet highly robust gene signatures in many situations and therefore we propose that it is worthy of consideration for inclusion in the gene expression analysis workflow.

Results and Discussion Algorithm
Identification of discriminant genes Given a G × N gene expression matrix X for G genes and N samples and a classification vector Y = (y 1 , …, y N ), where y j is the biological classification for sample j, we calculate the matrix U, as follows. The result is a G × K matrix U, such that for each gene we have the maximum of the running sum for each class. By sorting the genes based on decreasing u il for a given class, we can identify those genes that are most upwardly biased in a specific class in terms of their ordered expression levels.
On the other hand, for each gene i, we sort its N expression values in increasing order to identify the degree of downregulation of each gene in each class. For each of the N samples ordered from lowest to highest based on their expression values in row i we let We then compute the scoring function in the case that the N expression values are sorted in decreasing order. The result is a G × K matrix D, such that for each gene we have the maximum of the running sum for each class for the genes sorted in increasing order. By sorting the genes based on decreasing d il for a given class, we can identify those genes that are most downwardly biased in a specific class in terms of their ordered expression levels.
To illustrate the computation of the scoring functions (1) and (2) One limitation of this approach is that for a given gene, some samples of a given class may occur very early in the ordered list (leading to an elevated value for u), while other samples of that class occur verylate in the list (leading to an elevated value for d). In the worst case scenario, half may occur at the very beginning of the list, and half at the very end. This is arguably the worst possible classifier. To penalize such genes, we can take as our final score the difference between u and d. Therefore, we denote the matrix U ∆ = U -D = (u ild il ), and for each class l select genes with the highest scores in column l of the matrix U ∆ . These genes are selected because their expression in class l is higher than in other classes.
Likewise, we denote the matrix U ∆ = D -U = (d ilu il ) and for each class l select genes with highest scores in the column l of this matrix. To guarantee equal weight to each class in the classification stage which will be described later, we pick equal number of genes for each class. We identify the optimum number of genes per class through cross validation.
Variation 1: Weighted dualKS score Another potential pitfall of our approach may occur when a gene has high expression in a given class relative to other classes, but not relative to other genes. Since we will subsequently calculate KS statistics gene-wise (rather than sample-wise) in the classification step, it is important that the selected genes not only have biased expression in the class of interest, but also be among the highest (or lowest) expressed genes. To favor genes that satisfy both requirements, as opposed to only the first requirement, we can weight each gene according to its average expression in a particular class. We defined the weight for gene i and class l as: where G is the total number of genes and R il is the rank of gene i's average expression in class l among the G genes' average expression values in class l. As an illustration, suppose we have the following average expression matrix with G = 3 and K = 3, .
Here, for example for the gene i = 1 and the class l = 1, R 11 is 2 because in the first column the rank of 823.64 is 2. Consequently, we have w 11 2 3 =log / . Bases on this weighting strategy, genes with highest absolute expression in a given class are more likely to be included in the signature for that class. The weighted scoring functions û and d are then as follows: And the corresponding weighted score matrices are denoted Û ∆ and D ∆ .

Variation 2: Rescaled dualKS score
As an alternative to weighting, we also investigated the utility of rescaling the calculated KS statistic by an empirically determined scaling factor such that the range of possible scores is constrained to fall between 0 and 1. The scaling factor is simply the maximum score obtained for a given signature among the training samples. When the KS statistic is divided by this scaling factor, arbitrary differences in the expression level between signatures is eliminated.

Classification
Once a suitable gene signature has been described, new samples may be classified based on their expression values x = (x 1 , x 2, …, x n ; x n+1 , x n+2, …, x 2n ), where (x 1 , x 2 , …, x n ) and (x n+1 , x n+2 , …, x 2n ) denote the expression values of upregulated gene signature and downregulated gene signature respectively. Here, the enrichment score of each class-specific signature is determined, and the sample is assigned to the class whose signature achieves the highest score for that sample. Specifically, for each class l we define the signature for that class as the t highest scoring genes from U .l ∆ and D .l ∆ (or, for the weighted case, Û l . ∆ and D .l ∆ ), denoted u l * and d l * for the up and down regulated genes, respectively. As stated above, to guarantee equal weight to each class, we pick equal number of genes for each class. Therefore, there exists n = t × K, where t is determined empirically.
We then sort the n upregulated gene expression values (x 1 , x 2 , …, x n ) in decreasing order and let

Testing
In order to test our algorithm, we downloaded published data 6,15-22 that have been used in a previous study benchmarking classification methodologies. 7 We used the same error rate estimation algorithm that was used in that paper (0.632+ bootstrap), 23 allowing direct comparison of our algorithm to the algorithms tested previously. Error rates were estimated for each of the 10 benchmarking datasets using the default, weighted, and rescaled versions of our dualKS algorithm, and using upregulated gene signatures ranging between 5 and 50 (with an increment of 5) genes per class. Note that the range 5-50 and the increment of 5 were selected for illustration purpose. In fact, the range can start from 1 and end at a number different from 50 and the best increment should be 1 because we use the optimum gene signature to estimate the error rate. The reason we use the range 5-50 and the increment of 5 in this paper is to make the comparison of DKS variants (Fig. 1) clearer. With our selected   range and increment, it is shown that for most of the datasets, increasing the size of the gene signature above 15 genes per class did not result in an improvement in error rates and in some cases (Fig. 1) larger signatures performed more poorly than smaller ones, suggesting overfitting. The three variations of the DKS algorithm were roughly equivalent in terms of error rates for 6 of the 10 datasets. Where differences were observed, the rescaled variant always performed well, with the default algorithm performing more poorly in three of the 10 datasets, and the weighted algorithm performing more poorly in three datasets as well.
The estimated error rate of DKS using the optimum gene signature was smaller than the estimated error rate of random forest in 4 out of the 10 datasets, and was equivalent in two additional datasets (Table 1). DKS outperformed the other benchmarked algorithms to a similar degree.
The optimum gene signature size for each dataset was identified as the gene set size that achieved the lowest estimated error rate (Table 2). One advantage of the random forest algorithm is that it favors parsimonious gene signatures. Nevertheless, the optimum gene signature identified by the DKS algorithm was smaller than the random forest gene signature in 5 out of 10 datasets.

Implementation, availability and requirements
The DKS algorithm has been implemented in the dualKS package for R. The package performs both training (gene signature identification) and classification. These two steps are separated such that the identified gene signatures can be exported for use in other applications, and signatures identified by other methodologies can be utilized in the KS-based classification implemented in the package. The package also defines a custom plot function to generate summary classification plots as shown in Figure 2.
The dualKS package is freely available through the bioconductor project (http://www.bioconductor. org/packages/2.3/bioc/html/dualKS.html). The package is open source and can run on any operating system that is capable of running R.

Conclusions
The DKS algorithm performs discriminant analysis based on tissue enrichment analogous to gene set enrichment used for classification. As such, it is a natural and intuitive choice for defining class-specific gene sets to be used in subsequent GSEA-type analyses. Furthermore, DKS can efficiently identify highly parsimonious gene signatures amenable to downstream validation. While no algorithm demonstrates superior performance in every context, DKS is an attractive methodology worthy of consideration for inclusion in microarray analysis workflow

Authors Contributions
EJK conceived of the project. YY and EJK developed the algorithms, wrote the software, performed the analyses, and drafted the manuscript. NE and ZZ assisted   with analysis. NE and BTT provided support and guidance for the project and reviewed the manuscript.