SCIA: A Novel Gene Set Analysis Applicable to Data With Different Characteristics

Gene set analysis is commonly used in functional enrichment and molecular pathway analyses. Most of the present methods are based on the competitive testing methods which assume each gene is independent of the others. However, the false discovery rates of competitive methods are amplified when they are applied to datasets with high inter-gene correlations. The self-contained testing methods could solve this problem, but there are other restrictions on data characteristics. Therefore, a statistically rigorous testing method applicable to different datasets with various complex characteristics is needed to obtain unbiased and comparable results. We propose a self-contained and competitive incorporated analysis (SCIA) to alleviate the bias caused by the limited application scope of existing gene set analysis methods. This is accomplished through a novel permutation strategy using a priori biological networks to selectively permute gene labels with different probabilities. In simulation studies, SCIA was compared with four representative analysis methods (GSEA, CAMERA, ROAST, and NES), and produced the best performance in both false discovery rate and sensitivity under most conditions with different parameter settings. Further, the KEGG pathway analysis on two real datasets of lung cancer showed that the results found by SCIA in both of the two datasets are much more than that of GSEA and most of them could be supported by literature. Overall, SCIA promisingly offers researchers more reliable and comparable results with different datasets.


INTRODUCTION
In recent years, gene set analysis (GSA) has become the most common method in functional genomics studies, because evaluating a single p-value for a gene set is statistically more powerful than genewise tests. Typically, by choosing gene sets that represent biological pathways, GSA can help to bring insights into biological mechanisms, cellular functions, and disease states (Kanehisa et al., 2012). Various statistical procedures for gene set testing have been proposed and can be divided into three generations roughly in chronological order (Khatri et al., 2012;Zyla et al., 2017).
The first generation of GSA used over-representation analysis (ORA), where the first step is to define differentially expressed genes (DEGs) and non-DEGs in the input gene list by a certain threshold (Beissbarth and Speed, 2004). Then, the proportion of DEGs between a given functional gene set and the background gene set are tested by hypergeometric, binomial, or chi-square distribution. This comparison of the DEG proportions is the original theory of competitive testing. ORA has been reported with minor variations by many different authors (Khatri and Draghici, 2005). Even though the ORA method seems simple and effective, there are two serious drawbacks. First, the information about the strength of gene expression is lost by gene binarization. Second, the assumption of inter-gene independence needed by the testing methods is not satisfied in most cases.
The second generation of GSA, known as functional class sorting (FCS), was proposed to avoid these deficiencies. Instead of defining genes as DEGs and non-DEGs, different univariate gene-level statistics such as t-statistic (Al-Shahrour et al., 2005;Tian et al., 2005), Q-statistic (Goeman et al., 2004), signalto-noise ratio (Subramanian et al., 2005), fold change score and Z-score (Kim and Volsky, 2005), or their trans-formations (Tian et al., 2005;Ackermann and Strimmer, 2009) are used to measure DEGs and overcome the first problem of ORA. Then, a gene-set-level statistic is aggregated by these gene-level statistics. Aggregation approaches can be sum, mean, median of the gene-level statistics (Jiang and Gentleman, 2007), or calculating statistics such as the Kolmogorov-Smirnov statistic (Mootha et al., 2003;Subramanian et al., 2005), Wilcoxon rank sum (Barry et al., 2005), or the max-mean statistic (Efron and Tibshirani, 2006). Because the distributions of gene-set-level statistics are usually unknown, permutation procedures are used to complete FCS tests. According to different null hypotheses and corresponding permutation objects, FCSs can be classified as competitive or self-contained methods.
Assuming that all the input genes are independent of each other, competitive methods usually permute gene labels but lose the inter-gene information, which causes the false discovery rate (FDR) to be uncontrolled when the inter-gene correlations are high. Self-contained methods test each gene set independently by permuting sample labels but lose all the information outside the given gene set, which causes the FDR to be uncontrolled when the percentage of DEGs in the background genes is high. Irrespective of the prerequisites for the permutation procedure, the ORA methods can be considered as generalized competitive methods, whereas the classical methods based on multiple linear regression (Mansmann and Meister, 2005;Kong et al., 2006), by definition, are special cases of self-contained methods.
To address the second problem of ORA, some competitive FCS methods that take account of the correlations among genes have been proposed. The method of Nam (2010) removed the bias caused by the inter-gene correlations, while the method of Wu and Smyth (2012) alleviated the problem by estimating Abbreviations: GSA, gene set analysis; ORA, over-representation analysis; DEGs, differentially expressed genes; FCS, functional class sorting; FDR, false discovery rate; CSSPN, condition-specific shortest-path network; SCIA, self-contained and competitive incorporated analysis. the variance inflation factor. However, the information of intergene correlations is partially neglected in these procedures, which causes reduced sensitivity or uncontrolled FDR. Self-contained FCS methods seem to be more powerful than competitive ones and do not assume that all the genes are independent, but their null hypothesis is usually over restrictive (Goeman et al., 2004;Tian et al., 2005;Khatri et al., 2012). They assume that the gene set does not contain any genes with expression levels that are associated with different experimental conditions. Under this hypothesis, a few DEGs may cause a given pathway to be defined as a significant differential pathway (Khatri et al., 2012). Although the method of Wu et al. (2010) moderated this hypothesis using a Monte Carlo based testing method, the parameter describing the least proportion of DEGs in a pathway is given arbitrarily instead of calculated by the expression of genes outside the gene set. Even though competitive methods are overwhelmingly more commonly used than self-contained methods in the genomic literature (Gatti et al., 2010), information is still lost during the permutation procedures. Thus, the collision of applicable scopes between self-contained and competitive methods remains unsolved.
The third generation of GSA, known as the pathway topology (PT)-based approach, is based on the large amount of publicly available pathway knowledge. Mitrea et al. (2013) introduced dozens of PT-based methods with different principles and applicable conditions. Most of these methods consider topological information as a weight that measures the centrality of nodes but ignores the spatiotemporal specificity of topological information and changes in the topological structure between different experimental conditions (Fang et al., 2012;Gu et al., 2012;Dona et al., 2017). On this basis, the method of Yuan et al. (2016) proposed a novel statistic that combines node (gene expression) changes with edge (inter-gene correlation) changes. The utilization of biological information greatly improved the performance of PT-based methods, however, the testing methods of them are essentially the same as FCS methods in that they perform the same pipeline (Mitrea et al., 2013). Therefore, the above defects of FCS methods are not solved by PTbased methods.
Here, we propose a new GSA method with less information loss that can alleviate the bias of self-contained and competitive methods caused by their limited applicability. First, to capture all the information within a given gene set like other self-contained methods, a powerful multivariate statistic C is developed to test node changes and edge changes simultaneously. We chose Hotelling's T 2 , a self-contained statistic with the ability to penalize gene collinearity (Ackermann and Strimmer, 2009), for node testing because of its suitability for overcoming the limitation of competitive methods, and linear regression to test the edge changes among genes. Because of the additivity of chisquare distributed variables, these two statistics are transformed to the chi-square scale and summed up to get the C statistic. Second, we developed a novel permutation procedure based on a condition-specific shortest-path network (CSSPN, proposed by Dezso et al., 2009). The genes in the CSSPN are selectively permuted instead of permuting the whole gene labels as usual. This procedure does not disrupt inter-gene correlations but uses inter-pathway information from a priori biological networks, which creates a platform for the incorporation of self-contained, competitive, and PT-based methods. The whole pipeline is called self-contained and competitive incorporated analysis (SCIA), which has been implemented in an R package "SCIA" available on GitHub https://github.com/YiqunLiHIT/SCIA. Results from this study showed that the sensitivity and FDR of SCIA outperform four other commonly used GSA methods in most conditions in simulated datasets and the results are more stable with different real datasets of lung cancer.

Notations and Background Network
The main objective of SCIA is to detect gene sets that are differentially expressed under different experimental conditions. Here, we consider the gene set as pathway P for one experimental condition and P ′ for another. N 1 and N 2 are the sample size for P and P ′ , respectively. For convenience, we assumed that P and P ′ are under linear models: . , X n − X n ′ denote the vector of difference in the means of two groups. S and S ′ are the covariance matrices of P and P ′ , respectively. These notations are also used in the simulation studies.
We chose the background network of CSSPN as the Human Protein Reference Database (HPRD) network (Library et al., 2009), a centralized platform to visually depict and integrate information pertaining to do-main architecture, posttranslational modifications, interaction networks, and disease associations for each protein in the human proteome. Other comprehensive networks, such as the integrated network of seven common used networks in Edge Set Enrichment Analysis (Han et al., 2015) can also be used as the background network of SCIA.

C Statistic
The C statistic is proposed to measure the difference of a given gene set in different experimental conditions. It consists of two parts, the node difference model and the edge difference model. The node difference model is based on Hotelling's T 2 method: Under the self-contained null hypothesis H 0 : U = 0, T 2 follows a chi-square distribution with degrees of freedom equal to n representing genes in the given pathway with a sufficient sample size. This allows Hotelling's T 2 statistic to be combined with other statistics that also follow a chi-square distribution, because chi-square distributions are additive on the freedoms. There are many transformations of Hotelling's T 2 statistic which show its different characteristics. It can be transformed as: following an F distribution with the degree of freedom of n and N 1 + N 2 − n − 1 under a relatively small sample size. This allows Hotelling's T 2 statistic to be used alone when the sample size is insufficient. Typically, Hotelling's T 2 test is not only a node testing method but is related to the Pearson correlation coefficient. For convenience, assuming n = 2 and N 2 is big enough, the estimated value X i ′ (1 ≤ i ≤ 2) can be considered as constants µ i (1 ≤ i ≤ 2), then Hotelling's T 2 statistic can be transformed as: where t 1 and t 2 denote the t-statistics for the two component genes, and ρ represents the Pearson correlation coefficient between X 1 and X 2 . If t 1 = t 2 , Hotelling ′ s T 2 statistic can be simplified to: This transformation of T 2 indicates that when X 1 and X 2 are positively correlated and have similar changes in different experimental conditions, there would be a penalty on the Pearson correlation coefficient, which can avoid the disadvantages of the competitive methods. When X 1 and X 2 are negatively correlated but both have positive changes in different experimental conditions, which indicates that the correlation of X 1 and X 2 has changed in different experimental conditions, the T 2 statistic is would be more sensitive. Although Hotelling's T 2 statistic only slightly considers the correlations between genes, a statistically rigorous edge testing statistic is still needed. Based on the linear regression method, a Z-score-like statistic is combined with Hotelling's T 2 statistic in the C statistic.β i andβ i ′ can be estimated by the least square method. Then the Z-score-like B statistic can be written as: under the null hypothesis H 0 :β i =β i ′ , B i follows a standard normal distribution the same as the Z-score, and B 2 i follows a chi-square distribution and can be combined with Hotelling's T 2 statistic. Thus, we obtained the C statistic as: which follows a chi-square distribution with the degrees of freedom equal to n+(n−1), and can be used to test node changes and edge changes simultaneously. Notably, when the sample size is very small, T 2 and B 2 i will not obey the chi-square distribution, the parameter of SCIA about the correlation test should be set as "FALSE."

CSSPN-Based Permutation Procedure
To avoid the shortcoming of self-contained methods and utilize additional inter-pathway information from a priori biological networks, a CSSPN is built by SCIA. First, a set of DEGs should be selected as the terminal genes of CSSPN, and a set of initial genes can usually be selected in the same way. For each pair of genes (X i , X t ), where X i is in the initial gene set and X t is in the terminal gene set, all the shortest pathways are searched under a background network, such as HPRD (see section Notations and Background Network). When the results are not unique, the pathway with the highest C score will be chosen for a subpathway permutation procedure. In this procedure, 1,000 nodes are selected randomly as the initial gene set for each X t , which is the only terminal gene in this procedure. Assuming there are x shortest pathways, built by the randomly selected genes and X t , that have higher C scores than the given gene pair (X i , X t ), the permutation p-value of the sub-pathway (X i , X t ) is x/1,000. The permutation p-value and C statistic p-value are both adjusted using the method of Benjamini and Hochberg (1995), and only if the two p-values are <0.05, the sub-pathway is defined as a statistically significant pathway. Then, all the significant subpathways among the initial gene set and the terminal gene set are used to build the CSSPN. All the genes in the CSSPN can be considered as DEGs with edges and can be used in classical functional enrichment analysis.
In SCIA, background genes are used selectively in the CSSPN-based permutation procedure. Essentially, the selection of background genes means the information from the a priori biological network is utilized, because all the genes neighboring DEGs in the background network are used at a higher probability to establish the CSSPN. Additionally, because the permutation procedure does not destroy any inter-gene or inter-pathway structures, almost no information is lost in SCIA.

Simulated Data
The simulated data were generated under a linear model (Formula 1). Firstly, we generated the initial node X 1 of a given pathway P from the normal distribution N µ 1 , σ 1 2 . And then, the neighbor node X 2 = β 1 X 1 + ε 1 , X 3 = β 2 X 2 + ε 2 . . . . . . X n = β n−1 X n−1 + ε n were generated in the same way. Where ε i ∼ N 0, τ i 2 (1 < i ≤ n) was the residual error term. Similarly, we generated X 1 the pathway P ′ under another experimental condition. Under the H 0 hypothesis that there is no change in nodes and edges between different experimental conditions, we set the default simulating parameters as:µ 1 = µ 1 ′ = 1, σ 1 2 = σ 1 ′ 2 = 1,τ i 2 = τ i ′ 2 = 1, and β i = β i ′ = 0.5. In most of the following simulations without mentioned specially, the gene number n in a pathway was set as 5, the sample sizes N 1 and N 2 of different experimental conditions were both set as 100, and the simulations were repeated 1,000 times.

Scenarios
Four scenarios and 16 conditions were used to simulate different data structures and prove the extensive applicability of SCIA. The H 0 hypothesis condition was designed to evaluate the FDR and the H 1 hypothesis condition was designed to evaluate the sensitivity. The basic setting for the H 1 hypothesis is node or edge changes, with three additional conditions: sample size, intergene correlation, and percentages of DEGs in background genes that are outside the given pathway. In each scenario, only one additional condition is set as different values to highlight the robustness of SCIA. Thus, the four scenarios are: (1) Node change, 0% background DEGs, different correlations, and fixed sample size.

Evaluation of SCIA Performance With Simulated Data
To evaluate its performance, SCIA was compared with two powerful self-contained approaches, ROAST and NES, and two commonly used competitive approaches, CAMERA and GSEA (More details about these methods are stated in Supplementary Data Section 2). The application scope of these methods is quite different, so we compared SCIA with them under corresponding application conditions. As shown in Table 1, only competitive methods are suitable for scenario 3, and only self-contained methods are suitable for scenario 4.

SCIA Successfully Controls the FDR Under Different Inter-gene Correlations in Simulated Datasets
First, we compared SCIA with self-contained methods in scenario 1 under different inter-gene correlations in simulated datasets. The FDRs were well-controlled by all the three methods ( Table 2), and Figure 1 clearly shows the sensitivities of the three methods were quite similar, indicating the C statistic allowed SCIA to match the advantages of the self-contained methods. Noticeably, ROAST had high sensitivity under the  high inter-gene correlation. However, high sensitivity with intergene correlations close to 1 is not useful for combination with competitive approaches because a small percentage of highly correlated DEGs may produce unreasonable significant results. Second, we compared SCIA with competitive methods under scenario 2. Table 3 clearly shows that the FDR of GSEA lost control, which is common for competitive methods due to the correlation between genes, whereas CAMERA adjusted the high FDR only under a moderate inter-gene correlation of all genes but failed to control the FDRs under high inter-gene correlations. SCIA was the most robust method with well-controlled FDRs and similar sensitivities as CAMERA with comparable FDRs. Because there were no randomly selected DEGs in the given pathway, the SCIA results in scenarios 1 and 2 are comparable, which indicated that the information of background genes outside the given gene set was well-utilized by SCIA. A notable question is that the intersection ratio of the results obtained from SCIA and GSEA is decreasing with the increasing of inter-gene correlation, because GSEA is more sensitive in finding significant pathways with less but consistent expression changes. This result indicated that SCIA and GSEA could find different types of differentially expressed gene sets.

SCIA Has Higher Sensitivity and Lower FDR Than Two Competitive Methods Under High Percentages of DEGs in Background Genes
When the percentages of DEGs in background genes are high, there are likely to be relatively high overlaps between a given gene set and background DEGs. Therefore, self-contained methods are invalid in scenario 3 and SCIA was compared with competitive methods. Table 4 shows that SCIA had higher sensitivity than the other two methods and, interestingly, the FDR was negatively correlated with the percentage of DEGs in background genes. These results are reasonable and reflect the incorporation of different GSA methods in SCIA. Like other competitive methods, when the percentage of DEGs in background genes was high, SCIA assigned a competitive penalty of the significance to the given pathway, and when the percentage of DEGs in background genes was low, SCIA assumed only a few percentages of the DEGs would produce a significant result for the given pathway because there was no other explanation for these DEGs. Notably, in complex diseases such as cancer, DEGs usually account for more than 40% of the genes in a dataset, under which condition SCIA was the best method both in sensitivity and FDR.

SCIA Has Higher Sensitivity Than the Two Self-Contained Methods in Testing Changes of Inter-gene Correlations
Most competitive methods cannot simultaneously test node and edge changes; hence, we compared SCIA with self-contained methods under scenario 4 with the same H 0 hypothesis and FDRs ( Table 2) as scenario 1. The influence of different sample sizes was measured at the same time. Figure 2 shows that SCIA had the highest sensitivity and the slowest drop in sensitivity with decreasing sample sizes. However, when the sample size was 10 pairs, the sensitivity of SCIA dropped sharply because of the approximation of chi-square distribution (see method), which needs sample sizes of 15-30 pairs. Unsurprisingly, ROAST had the lowest sensitivity because it was not designed for this purpose. Besides, although the edge testing modules of SCIA and NES are quite similar, SCIA was more sensitive because edge changes are also considered by Hotelling's T 2 (see method), indicating SCIA does not simply superpose node testing and edge testing methods like NES.

Evaluation of SCIA Performance With Real Datasets
We applied SCIA to recover differentially expressed genes and pathways involved in lung squamous cell carcinoma (LUSC), a common type of non-small-cell lung cancer using two datasets, one from the NCBI's GEO (Gene Expression Omnibus) and one from TCGA (The Cancer Genome Atlas) database. The GEO dataset (Series Accession: GSE103512, Brouwer-Visser et al., 2017) contains 23 LUSC sub-type cancer samples and 9 normal samples. The LUSC dataset from TCGA contains 502 LUSC samples and 51 normal samples. The two LUSC datasets were used as input to compare the sensitivity and robustness of SCIA and GSEA. In the CSSPN-base permutation procedure of SCIA, all the genes were mapped to the HPRD network, then the top 2% of DEGs (about 200 in each dataset) were defined as the initial and terminal genes of CSSPN (see method). All the nodes in the CSSPN were used for classical functional enrichment analysis based on a hypergeometric test. Unlike the simulation studies, the adjustment of permutation pvalues (see method) should be moderate here. This is because, under the H 0 hypothesis of simulation studies, there is no relation between the background network and the given gene set, whereas, in real organisms, hundreds of genes in the background network will differentially expressed in response to the DEGs in the given gene set. Due to the C statistic p-values of all the single pathways were already Benjamini and Hochberg (1995) adjusted, we did not adjust the permutation p-value in the following analysis, indicating there are approximate 500 genes in the HPRD background network that, on average, are affected by the terminal DEGs. This p-value threshold is a parameter of SCIA and can be set as different scores according to different data and requirements.
The results of the KEGG functional enrichment analysis are shown in Supplementary Tables S1-S4. SCIA found 131 and 64 pathways and GSEA found 46 and 40 pathways in the GSE103512 and TCGA LUSC datasets, respectively. Among them, 55 (42%) SCIA pathways were common between the two datasets, whereas only 5 (11%) of the GSEA pathways were common between the two datasets. These results illustrated that there was little comparability between the two results of GSEA, while, SCIA could demonstrate common results in different lung cancer datasets and the individual differences in the two researches, implying the two results of SCIA with different datasets were comparable. More than 33 of the 55 SCIA pathways found in both of the two datasets have been reported previously to have relationships with lung cancer (Table 5), including the non-small cell lung cancer. While, most of these pathways were not detected by GSEA. This result showed that SCIA could find many positive pathways that GSEA could not, and the high proportion of results with literature supporting indicated that the intersection of results of SCIA with different datasets could increase the reliability. Further, SCIA produces a CSSPN, which can be considered simply as a set of DEGs. SCIA detected 41 DEGs in the two datasets, and more than 27 (Supplementary Table S5) of these genes have been reported previously to be related with lung cancer.

DISCUSSION
SCIA is the first GSA method that combines the advantages of self-contained, competitive, and PT-based methods. SCIA has three main advantages over the other methods as was shown by the simulation studies. First, SCIA is powerful and statistically rigorous under high inter-gene correlations, which are conditions under which most competitive methods lose control of FDR. Second, SCIA has higher sensitivity and minimum FDR compared to two competitive methods (GSEA, CAMERA) under a high proportion of DEGs in background genes, which are conditions that make most self-contained methods invalid. Moreover, SCIA uses an a priori biological network and performs better than ROAST and NES in testing edge (inter-gene correlation) changes. Overall, the FDR of SCIA was well-controlled and its sensitivity was higher than that of the other four methods tested (GSEA, CAMERA, ROAST, and NES) under most simulated conditions, highlighting the extensive applicability and unbiased results of SCIA.
The robustness of SCIA can be attributed to two aspects. First, its extensive applicability with reliable and unbiased results, as mentioned above, are the most important reasons. Second, through the CSSPN-based permutation strategy in SCIA, a reasonable hypothesis is innovatively combined with a priori biological information. Briefly, if DEGs can be mapped only in one gene set, a positive weight is added to them because there is no other explanation for the differential expressions of these genes. Therefore, for SCIA, comprehensiveness of the background networks is more important than its accuracy. However, when the a priori biological networks are more comprehensive, the hypothesis of SCIA becomes more reasonable and the results are more precise. This robustness gives SCIA the ability to calculate with different datasets and to integrate the results of SCIA with different datasets.
There are many potential applications for SCIA, including differential expression analysis (Dona et al., 2017), sub-pathway analysis (Martini et al., 2013), and micorRNA target gene prediction (Wang, 2008). First, all of the genes in the CSSPN can be considered as DEGs and used independently. In addition, CSSPN itself can be considered as a cascading effect pathway when the input data are from a knockout/over-expression experiment of a single gene. Second, if the function of differential pathways can be biologically confirmed, the sub-pathway of the given functional pathway can be built without the permutation procedure. Third, the choice of initial gene set is very flexible and can be tailored for different purposes. For instance, if the input data are derived from a microRNA knockout/overexpression experiment, the initial gene set can be select as the predicted target genes of the microRNAs, and the significant predicted targets will have more potential to be the targets of these microRNA in a specific experimental condition.