Introduction

Identifying biological pathways significantly regulated under various conditions has become one of the most common tasks in genome-wide expression profiling or association studies 1, 2, 3, 4. Biological pathways being investigated in such analyses can be obtained from various types of pathway databases. Several databases, such as KEGG 5, 6, WikiPathways 7 and Biocarta, manually create electronic graphs of structured pathways for cellular signaling and metabolic processes. Other databases, such as Gene Ontology (GO), PANTHER 8, 9, Reactome 10 and MSigDB 11, curate only gene composition information for pathways and usually do not provide graph representations. To develop general-purpose pathway analysis algorithms applicable to both types of pathway databases, it is a traditionally favored treatment to reduce “real” biological pathways featured with complex gene-gene linkages and topological arrangements into gene sets that are simply based on gene compositions of pathways.

The “gene set over-representation analysis” approach tests whether a gene set is over-represented in a given gene list 12. With Fisher's exact test or chi-square test, this approach holds a “competitive model” of significance assessment, in the sense that the gene set is tested against random genes from the genome 13. This approach is especially useful for studies with a very small sample size, though difficult in determining an optimal gene list from microarray data. Another approach, “gene set analysis (GSA)”, pioneered by Gene Set Enrichment Analysis (GSEA) 11, calculates a gene set statistically summarizing gene expression changes over the gene set, which are then compared to null distributions to evaluate significance. Gene set statistics proposed include the Kolmogorov-Smirnov type statistics in GSEA 11, the normalized mean of fold changes in Parametric Analysis of Gene Set Enrichment (PAGE) 14, the “maxmean” statistics in MAXMEAN-GSA 15, the quadratic sum of modified Student's T statistics in SAM-GS 16, the contrast between fold changes of genes within and outside gene sets in GAGE 17, as well as other choices such as mean 18, 19, absolute mean and mean of absolute (MeanAbs) of gene statistics 15. Choices of the null distributions are among a “competitive model”, a “self-contained model”, or a hybrid of the two 13, 15, 19. Unlike the “competitive model” where significance of a gene set is judged by comparing with the rest of genes in the genome, “self-contained model” tests whether the gene set itself is associated with the conditions under study. The “competitive model” has been adopted by Q1-test, PAGE and GAGE 14, 17, 19, the “self-contained model” by Q2-test 19, and a hybrid of the two models by GSEA and MAXMEAN-GSA 11, 15. These two approaches do not cover all gene set-based pathway analysis methods that have been proposed. For example, there are methods that test gene set association with sample conditions in logistic regression models 20, 21, 22, methods that focus on principal components of gene sets 23, 24, and methods that take only a core subset of genes for analysis 25.

Although reduction of “pathways” to “gene sets” seems inevitable to include pathway databases with gene compositions only, and also very convenient and fruitful for algorithm development, there are important issues worth critical consideration when examining carefully the difference between “pathways” and “gene sets”. In “gene sets”, the functional interactions among genes are ignored and all genes are completely equivalent to each other. This is not true in many “real” biological pathways, where some genes are indeed more “central” and “indispensable” than others, such as the p53 gene in the p53 signaling pathway. In addition, the potential curation errors in construction of pathway databases might have been misclassified into a pathway where some irrelevant genes should not be taken equivalently as other genes. Therefore, we realize that there is clearly a “constitutive non-equivalence” among genes in a pathway irrespective of the conditions under study, and that it is more appropriate to model a pathway with a set of genes with constitutively non-equivalent weights, rather than a set of essentially equivalent genes.

Here we constructed a gene functional association network based on protein-protein interactions (PPIs), co-annotations and co-expressions. We then used this network to determine the constitutive non-equivalence of genes and assign gene weights within pathways. These gene weights can be directly incorporated into classical GSA pipelines. We demonstrated that this gene-weighted GSA approach is reliable and reproducible for microarray data interpretation. We have implemented the network-based gene-weighting algorithm, as well as the gene-weighted version of GSA, in an open-source R package named GANPA (http://cran.r-project.org/web/packages/GANPA/index.html).

Results

Strategy of network-based gene weighting within pathways

Given the notion that genes are constitutively non-equivalent in pathways, we seek to develop a strategy to determine the gene non-equivalence in the form of gene weight. We reason that, if one gene is specifically associated with more genes in the pathway than expected, it is more likely to be functionally “important” in this pathway, and less likely to be randomly introduced into this pathway by curation errors. This association specificity could be estimated in a hypergeometric sampling model, and used to judge how strongly a gene is related to the pathway (Figure 1A, Materials and Methods).

Figure 1
figure 1

Network-based gene-weighting algorithm. (A) An illustration of the gene-weighting algorithm. A gene functional association network is required to provide inter-gene association linkages. For any gene Gi (red) in a given pathway S of K genes, its Mi functional associated genes (brown) are distributed within (Xi) and outside (Mi-Xi) the pathway. A hypergeometric distribution model can be used to estimate the specific associations between Gi and S, which reflects Gi's contribution to the pathway (see Materials and Methods). (B) Inter-gene functional associations and gene weights determined with gNET for the p53 hypoxia pathway (Biocarta). Each node represents a gene and its size proportional to gene weight. Genes are linked by functional associations in gNET. Genes with no functional associations with other genes are also shown. (C) Inter-gene functional associations and gene weights determined with FunCoup for the p53 hypoxia pathway (Biocarta). Genes are linked by functional associations in FunCoup.

To achieve this goal, one practical way is to establish a gene functional association network 26, which could provide inter-gene linkages for evaluating gene-to-pathway associations. We have constructed a comprehensive gene functional association network, referred to as gNET hereafter, from three types of gene associations: PPIs, co-annotation in GO Biological Process (BP), and co-expression in large-scale gene expression microarray data (Materials and Methods, Supplementary information, Data S2). We have controlled the association specificity by filtering PPIs without PubMed references and BP terms that are too general to provide specific association information (Materials and Methods, Supplementary information, Data S1). Using a gene functional association network has several appealing features as compared with the inter-gene linkages recorded in certain databases such as KEGG (Supplementary information, Data S1).

Such a network-based gene-weighting strategy is general-purpose and suitable for almost all public pathway databases in that it requires no additional information except for gene compositions. We performed a number of tests and found that this strategy provided biologically consistent gene weights for various pathways and functional gene sets (Supplementary information, Tables S1 and S2, and Data S3). Taking p53 gene for example, the computed gene weights suggest that it acts as a core gene in pathways such as the p53 hypoxia signaling pathway (Figure 1B) and p53 signaling pathway (with a high weight ranking top 1), and it participates but not dominates in pathways such as Huntington's disease pathway and cell cycle checkpoint pathway (ranking 72 and 102, respectively; Supplementary information, Table S1). Another example is the EGFR gene, which is suggested to be a more central gene in some pathways (such as AT1R pathway 27 and EGF pathway 28) than others (such as calcium signaling pathway 29 and cytokine-cytokine receptor interaction 30) (Supplementary information, Table S2). We have tested another functional association network named FunCoup 31, which produced similar gene weights in the pathways, with p53 hypoxia signaling pathway again taken as an example (Figure 1C). For those genes that have no functional associations with other genes in the pathway, we simply assigned a minimum weight (Figure 1B and 1C, Materials and Methods). This way, the gene composition information from pathway databases is always retained even when the gene association network suggests a poor association with the pathway for some genes. Although this approach seems a bit conservative, it could help improve the stability of the gene weight estimates, since a gene association network might not be necessarily comprehensive and accurate enough for all pathways from various databases.

Stability and robustness of the network-based gene weights

We next addressed whether the gene weights determined from the above strategy are indeed stable, knowing the false positives that could have been introduced in the data sources of gNET, such as PPIs, co-annotation and co-expression. As a previous report on human PPI network has suggested, there was 85.5% verification rate of yeast-two hybrid protein pairs by co-affinity purifications, indicating 14.5% false positiveness 32. We performed two types of perturbation to gNET to re-calculate gene weights for 833 pathways (KEGG, Biocarta and Reactome) extracted from the MSigDB collection (version 3.0) at Broad Institute. First, 30% of the linkages in gNET were removed randomly for 1 000 times. Second, all the linkages in gNET were randomized, while retaining the original network topology (that is, we shuffled the node labels but retained topological properties such as node degree distribution). We again took the p53 hypoxia signaling pathway as an example. Compared with the nearly complete destruction of gene weights by edge randomization, the gene weights were only quite slightly affected by the 30% edge removal perturbation (Figure 2A and 2B). Overall, for all the 833 pathways, network randomizations have caused an average decrease of gene weights as dramatically as more than 40% for a majority of the 833 pathways, whereas the 30% linkage removals only have a very minor effect on gene weights (around 10% decrease) (Figure 2C). Further, we have tested another functional association network, FunCoup 31, on weighting genes in the pathways (Supplementary information, Data S1). The weights assigned by these two independently constructed networks are quite similar (Figure 1C and data not shown).

Figure 2
figure 2

Robustness of network-based gene weights in pathways. (A) Gene weights in the p53 hypoxia pathway. Red: weights based on gNET. Black: weights after random removal of 30% network edges, 1 000 times to obtain the standard deviation. (B) Gene weights in the p53 hypoxia pathway. Red: weights based on gNET. Black: weights after topology-retaining network edge randomization, 1 000 times to obtain the standard deviation. (C) Distribution of average percentage of gene weight changes in all the 833 pathways. Blue: random removal of 30% network edge, 1 000 times. Red: topology-retaining network edge randomization, 1 000 times.

Taking the perturbation data here and the above weighting strategy design together, the network-based gene weights estimated in a hypergeometric model are generally stable and robust upon potential network perturbations, providing a strong support for development of pathway analysis methods with gene weights.

Incorporating gene weights into classical pathway analysis approaches

The gene weights determined from above accounting for the constitutive non-equivalence can be integrated into GSA pipelines to generate a weighted version that could potentially allow more specific evaluations of pathway significance (Figure 3, Materials and Methods). We have two motivations here for the validity of such a treatment. First, expression changes of core genes, as compared with other genes, can reasonably be considered as stronger evidences when deciding whether a pathway is regulated. Second, assigning smaller weights to less relevant genes present in a pathway allows a more reliable pathway significance assessment.

Figure 3
figure 3

Schematic diagram of classical gene set analysis approaches and GANPA. Left: classical gene set analysis approach; Right: GANPA. See Results, and Materials and Methods for details.

To illustrate the benefits of weighted pathway analysis, we choose MeanAbs as the pathway statistic, which is not too complicated, yet still effective in demonstrating the reliability and characteristics of a gene-weighted GSA approach 15, 33 (see Materials and Methods for details). Briefly, Student's T statistic (also referred to as “T score”) is used as the gene statistic to summarize differential expression, and the pathway statistic MeanAbs is simply an average over absolute gene statistics, taking into account both up- and downregulations. In the corresponding weighted pathway analysis, the pathway statistic is a weighted average of absolute gene statistics, and for convenience it will be referred to as W-MeanAbs (weighted MeanAbs) below. To test the null hypothesis of no associations between a pathway and study conditions (“self-contained model”), sample permutations provide an empirical null distribution for significance evaluation. Nonetheless, the nominal pathway statistics are subject to properties of different gene sets (for instance, W-MeanAbs would be affected by pathway size, gene weights and inter-gene correlations), and therefore are not directly comparable across different pathways (Supplementary information, Data S1). To allow inter-pathway comparisons with pathway statistics, we standardize the nominal statistic to obtain a normalized pathway statistic (NS), which is in general comparable among pathways under the null hypothesis of no association (Materials and Methods, Supplementary information, Data S1). To correct for multiple comparisons, we use false discovery rate (FDR) along with permutation P values as the standard method.

To test the effectiveness of W-MeanAbs, we first took the well-known p53 dataset for a comparative evaluation. The p53 dataset, consisting of gene expression profiling of 17 p53-wild-type (WT) and 33 p53-mutated (MUT) cancer cell lines, is an ideal benchmarking dataset for pathway analysis algorithms 11. We used the same gene sets from the original GSEA paper for p53 dataset 11, which were assigned gene weights based on gNET. We applied MeanAbs and W-MeanAbs to this dataset, with a 0.15 permutation FDR cutoff used to control error rates. Both methods have effectively identified p53-related pathways, yet with an obvious improvement in the significance from W-MeanAbs (Table 1). Several pathways well known to be related to p53 functions are only identified by W-MeanAbs: three apoptosis-related pathways (mitochondria pathway, ceramide pathway and BCL2 network), two cell cycle-related pathways (cell cycle regulator, cell cycle arrest), and one p53-related pathway (p53 signaling). Taking a look into these pathways would reveal features of the weighted approach. In cell cycle regulator pathway, high-weight genes such as CDKN1A, CDC6 and CDK5 showed strong expression changes (Figure 4B). In the BCL2 network, the high-weight genes such as BCL2, BCL2A1, BCL2L2, CASP8 and BID showed strong differential expression, whereas most of the low-weight genes such as IGF1, IL3, ARHA, APBA2BP and ACTB showed no significant expression changes (Figure 4C). Interestingly, HSP27 pathway was ranked after three p53 pathways (p53 UP pathway, p53 pathway, p53 hypoxia pathway) by W-MeanAbs, not as significant as that by MeanAbs, with a decrease in normalized statistic (NS) from 4.9 to 4.1. This seems to imply that the HSP pathway may be related to p53 functions, yet still not a central pathway. The central genes in this pathway, such as HSPB1, FASLG, APAF1, CASP3 and CYCS, did not have strong expression changes, whereas those heavily differentially expressed genes such as IL1A, FAS, BCL2, CASP9 and TNF are not quite specifically associated with HSP functions and thus only have a low weight (Figure 4A). To further verify the improvement of W-MeanAbs over MeanAbs, we reanalyzed the p53 dataset with gene sets weighted by another functional association network FunCoup. We observed very similar improvements, with the three apoptosis-related pathways and the two cell cycle-related pathways again identified only by W-MeanAbs, and a similar decrease in significance for the HSP27 pathway (Supplementary information, Table S3).

Table 1 Pathways significant in p53 dataset by MeanAbs and W-MeanAbs
Figure 4
figure 4

Gene functional association networks for selected pathways in p53 dataset. Node color reflects expression changes in the p53 dataset. Red: upregulated expression. Green: downregulated expression. Gray: not covered by microarray data. (A) HSP27 pathway, with a color panel showing the mapping between T scores and node colors. (B) Cell cycle regulator pathway. (C) BCL2 network.

Although we mainly used MeanAbs as the demonstration statistic throughout this work, it should be emphasized that the pathway gene weights are conceptually reasonable to be integrated with other statistics in most, if not all, GSA methods. As an example, we integrated gene weights with the adapted KS statistic used by GSEA (see Materials and Methods). We then applied W-GSEA and standard GSEA to the p53 dataset, with gene weights assigned by gNET. GSEA/W-GSEA commonly identified the radiation sensitivity pathway and three p53 pathways as significant at a 0.15 FDR, similar to the MeanAbs/W-MeanAbs analysis described above (Supplementary information, Table S4). Nonetheless, most of the commonly top pathways showed an improvement in the normalized pathway enrichment score (NES) in W-GSEA as compared with GSEA, such as the p53 pathway, p53 UP, radiation sensitivity pathway and p53 signaling, which is a similar case in W-MeanAbs as compared with MeanAbs (Table 1, Supplementary information, Table S4). Further, the HSP27 pathway was not identified as significant as MeanAbs, also similar to its decreased significance in W-MeanAbs as compared with MeanAbs (Table 1, Supplementary information, Table S4), which can be similarly explained (Figure 4A). However, GSEA/W-GSEA indeed appeared to identify fewer significant pathways, probably a consequence of the low power issue related to GSEA methodology as suggested by several other research groups 13, 16, 19, 34. For this reason, we will just stress here that gene weights can be reasonably combined with statistics as complex as GSEA, and continue using W-MeanAbs/MeanAbs as our standard method in the following analysis.

Pathway identification in a relatively simple human disease

One of the important applications of pathway analysis is to interpret global gene expression patterns in human diseases. Thus, it would be essential to see whether a novel pathway analysis algorithm indeed works well for disease datasets. Here we took asthma as an example. The data that we used consisted of airway epithelial samples from seven healthy and nine asthmatic children 35.

The MeanAbs and W-MeanAbs method identified 15 and 13 pathways, respectively, at a 0.05 permutation FDR, with 9 of them shared by two methods (Table 2). Among the shared pathways, RAC1 pathway, metabolism of carbohydrates, ECM pathway, CTCF pathway, Basigin interactions and glycolysis were ranked as the top 10 significant pathways in both methods (Table 2). Among them, Basigin interactions had a number of its genes differentially expressed, with ITGB1, L1CAM, SLC16A1 upregulated, and BSG, SLC3A1 and SLC16A3 downregulated (Figure 5A). In W-MeanAbs, Basigin interactions was reported to be even more significant in the sense that its core genes, such as BSG, SLC3A2 and ITGB1, were all dramatically differentially expressed (Figure 5A).

Table 2 Significant pathways in the asthma dataset by MeanAbs and W-MeanAbs
Figure 5
figure 5

Gene functional association networks for selected pathways in asthma dataset. Node color reflects expression changes in the asthma dataset. Red: upregulated expression. Green: downregulated expression. Gray: not covered by microarray data. (A) Basigin interactions (Reactome), with a color panel showing the mapping between T scores and node colors. (B) Fructose and mannose metabolism (KEGG). (C) Pyruvate metabolism pathway (Reactome). (D) Renin-angiotensin system (KEGG). (E) VEGF pathway (Biocarta). (F) VEGF pathway (Biocarta), with gene weights computed after correcting for multi-subunit proteins. Gene weights in A-E are not corrected for multi-subunit proteins. Each circle in E, F indicates a cluster of genes encoding one multi-subunit protein.

Then it would be interesting to examine pathways that are ranked top by one method but not by the other methods, which should clearly reflect the different characteristics of the two methods. The fructose and mannose metabolism pathway (KEGG), ranked 40 by MeanAbs, was ranked as the top 6 significant pathways by W-MeanAbs. In its 11 highly weighted genes, 7 of them, including PFKFB1, TPI1, ALDOB, HK1, PFKL, HK3 and ALDOA, showed a strong differential expression in asthma (Figure 4C). On the contrary, the pyruvate metabolism pathway (Reactome) and renin-angiotensin system (KEGG), ranked as the top 10 and top 12 significant pathways by MeanAbs, were ranked 32 and 19 by W-MeanAbs, respectively, and did not pass the significance level (Table 2). For pyruvate metabolism pathway, the genes are distinctively clustered into two groups according to the functional associations, the “Basigin group” and the “Pyruvate group” (Figure 5C). The “Pyruvate group” is undoubtedly the desired group here in a pathway named “Pyruvate metabolism”; yet in asthma, it is the “Basigin group” that was more dramatically changed in expression (Figure 5C), recalling the top significant pathway Basigin Interactions by MeanAbs/W-MeanAbs that we have just mentioned (Figure 5A). Surely there are some biological concerns that Reactome experts have put the “Basigin group” together with “Pyruvate group” in the “pyruvate metabolism pathway”, but the point here is that W-MeanAbs can automatically “double-check” and reliably “re-adjust” the two subgroups for their relative “contributions” to the whole pathway. The renin-angiotensin pathway, where several highly expression-regulated genes, such as CTSA, LNPEP and ANPEP, are poorly associated with the majority of genes of the pathway, whereas several high-weight genes such as AGT, ACE2 and CMA1 do not show obvious expression changes, is therefore judged by W-MeanAbs as not significant at the 0.05 permutation FDR level (Figure 5D and Table 2).

Taken together, a functional association network-based gene-weighting approach has unique strength in determining pathway significance given expression data, especially for pathways consisting of heterogeneous subgroups with non-equivalent association confidences to the whole pathway.

Correction for genes encoding multi-subunit proteins

In the asthma data, the VEGF pathway, consistently identified to be altered in asthma according to a number of reports 36, 37, 38, 39, appeared to be less significant in W-MeanAbs as compared with MeanAbs (Table 2). When looking into this pathway, we noticed one issue that should be solved for more accurate gene weighting. This issue, referred to as the “over-counting” problem, is closely related to the genes encoding multi-subunit proteins. Since genes encoding subunits for multi-subunit proteins are usually considered to be different genes, a multi-subunit protein can artificially take a higher proportion in a pathway than a single-subunit protein. Although this issue should also affect classical GSA, it is more severe in our original network-based gene-weighted approach. As clearly depicted in the VEGF pathway (Biocarta), the 8 genes encoding EIF2B and EIF2S proteins take a proportion of nearly 30% of this 29-gene pathway, and the enriched inter-gene linkages among the subunit genes elevated the weight of each other in our network-based weighting strategy. The actually less close association of these two multi-subunit proteins with the VEGF pathway further exaggerated the side effect, leading to an unexpected masking of the otherwise central role of VEGFA and its receptor genes (Figure 5E).

To determine whether this issue is prevalent enough for a specialized treatment, we prepared a collection of multi-subunit proteins and explored their prevalence in biological pathways. A list of genes encoding 82 multi-subunit proteins were extracted from human genes and provided as Supplementary information, Data S4. The majority of them have less than 5 subunits, yet few can have more than 10 (Figure 6A). 567 of the 833 pathways do not involve any of the 82 multi-subunit proteins, suggesting that a combination of subunit genes would not affect most of the pathways in an analysis (Figure 6B). Nonetheless, since as many as 266 pathways involve at least one multi-subunit protein, it is essential to design a refined version of the gene-weighting strategy so as to make the gene weights assigned for these pathways more accurate.

Figure 6
figure 6

Combination of subunit genes encoding multi-subunit proteins for pathway analysis. (A) Distribution of subunit numbers (size) for multi-subunit proteins. (B) Distribution of multi-subunit proteins (count) in all the 833 pathways. (C) An illustration of the gene-weighting algorithm with a correction for multi-subunit proteins. In this example pathway, two multi-subunit proteins are present, as shown by the two dashed circles. X (red) and Y (violet) are subunit- and protein-encoding genes, respectively. In the original gene-weighting scheme subunit- and protein-encoding genes were not distinguished; whereas in this corrected version, to solve such confusion, we replaced gene-level association counting by protein-level association counting. Therefore, in this corrected version, the within-pathway associations of X and Y are counted as 4 and 5, respectively, rather than 6 and 7. Outside-pathway associations are also counted at protein level.

To correct this problem, our refinement is to add two extensions to the original gene weighting (Figure 6C). First, we do association counting on the whole-protein level for multi-subunit proteins rather than on each of its subunits. Second, genes encoding subunits and single proteins are both assessed by their associations with the pathway. The new weights of the 833 pathways after correction for multi-subunit proteins are provided as Supplementary information, Data S5. We reanalyzed the asthma dataset with the new gene weights and observed obvious changes in the significance evaluation of RAC1 pathway and VEGF pathway (Table 2). It can be clearly seen that the new weights in the VEGF pathway are now much more consistent with biological knowledge, which recaptured its significance in asthma (Figure 5F and Table 2). As the subunit combination of multi-subunit proteins has been seamlessly integrated with the gene-weighted GSA pipeline, for simplicity, we will still refer to this new version of W-MeanAbs with the subunit combination as W-MeanAbs hereafter.

Conserved pathways across three breast cancer datasets

To further test the multi-subunit combined weighted approach, we collected three breast cancer datasets from GEO database, each with cancer samples and normal controls. As the comparison between malignant cancer samples and normal tissues usually generates a large number of differential genes and pathways (at 0.1 permutation FDR level, the three breast cancer datasets have 661, 663 and 653 significant pathways for MeanAbs, respectively; and 654, 664 and 649 for W-MeanAbs, respectively), we took the top pathways from each dataset analysis to see how many pathways were conserved across all datasets, which might be a more appropriate metric to evaluate the performance of a pathway identification method, especially with respect to reproducibility. We tried serial cutoffs ranging from top 30 to top 80 and W-MeanAbs frequently identified more conserved pathways (Supplementary information, Figure S1). When setting top 80 as the cutoff, MeanAbs and W-MeanAbs identified 9 and 14 conserved pathways, respectively, with 8 of them being common to both methods (Table 3). The 8 conserved pathways are mainly related to cytoskeleton regulation, cell motion, cancer signaling, p53 signaling, and cellular synthesis (Table 3). The 6 pathways unique to W-MeanAbs are all known to be related to breast cancers or features common to cancers such as proliferation-related pathways: PDGF pathway 40, 41, pancreatic cancer, downstream signal transduction (related to PDGF signaling 40, 41, see Reactome 10 for details), G2-M transition 42, 43, metabolism of nucleotides 44, and regulation of APC activators between G1-S and early anaphase 45, 46, 47. The cell cycle checkpoint pathway unique to MeanAbs is also common to multiple cancers. Nonetheless, the reproducibility is obviously better in W-MeanAbs than in MeanAbs, which is essentially a required capability of pathway analysis methods.

Table 3 Conserved pathways in three breast cancer datasets

Throughout the applications to various real datasets, we have shown that a weighted approach indeed offers an improved accuracy for pathways consisting of core genes and less relevant genes, as well as pathways with heterogeneous subgroups that have non-equivalent associations to the pathway. The potential weighting bias due to involvement of multi-subunit proteins has been further corrected to improve reliability. With the breast carcinomas datasets, we have demonstrated the good reproducibility of our weighted approach, which would be useful for cross-dataset comparisons that are common in human disease studies.

Discussion

Unlike conventional pathway analysis approaches that reduced biological pathways to gene sets and adopted an implicit assumption on gene equivalence, we have recognized the prevalence of “constitutive non-equivalence” within biological pathways. Two types of “non-equivalence” might be involved: (1) biological non-equivalence: this is supported by the prevalent presence of genes with non-equivalent “contributions” or “association strengths” to pathways, such as p53 as core gene in the p53 hypoxia pathway, as well as by the presence of non-equivalent subgroups in pathways, such as the “Basigin group” and “Pyruvate group” in the “pyruvate metabolism pathway”; (2) technical non-equivalence: this specifically refers to the potential misclassification of irrelevant genes in the curation process, as well as false positives in computational classifications, such as predicting disease prognosis signature genes from high-throughput data. We took a relatively conservative way in retaining the pathway gene compositions even when some genes were poorly associated with other genes in the pathway, in case the network was not comprehensive and precise enough. This would improve the stability of gene weight estimations. A good robustness of the gene weights has been illustrated by network perturbations and topology-retaining network randomizations. Throughout this work, we have demonstrated that “a set of genes with non-equivalent weights” serves as a better model to account for the two types of intrinsic non-equivalence in both pathways and functional gene sets, as compared with the classical view of “a set of equivalent genes”. Integration of the constitutive gene weights into GSA allows us to develop a network-based gene-weighting approach for pathway analysis, as implemented in an R package named GANPA (http://cran.r-project.org/web/packages/GANPA/index.html).

Recently, several algorithms have tried to demonstrate the possibility of incorporating topological information in pathway graphs into gene set significance analysis, notably Pathway-Express 48, SPIA 49, SEPEA 50 and PWEA 51. Pathway-Express defines a pathway score summarizing gene set over-representation probability and a gene “perturbation factor” accounting for both expression alteration and directional linkages, which is revised in SPIA by removing the mingling effects of expression changes in the “perturbation factor” 48, 49. Unlike Pathway-Express and SPIA, SEPEA and PWEA omit linkage directions in KEGG pathways 50, 51. SEPEA adopts two types of topological information, distances to terminal genes and inter-pathway distances, whereas PWEA considers only inter-gene distances 50, 51. These represent a recent research shift from the completely reduced gene set view of pathways to a more structural and functional view that is closer to real biological pathways. From this perspective, our work can be seen as among the similar research stream.

However, GANPA is unique in several aspects. The above four algorithms mainly touch the question of designing a pathway statistic to combine structural information and gene expression changes, while our method asks the question of weighting the non-equivalence of genes within pathways using a gene functional association network based on multiple sources of datasets including protein-protein interactions, co-annotations and gene co-expressions. As a network-based gene-weighting method, GANPA offers several notable advantages over the above four algorithms that rely on curated gene-gene linkages in KEGG. First, it is more general-purpose and widely applicable, as both pathways with graph representations (KEGG, WikiPathways, etc.) and pathways with gene compositions alone (Reactome, MSigDB, GO BP, PANTHER, etc.) can be used for weighted GSA. Second, the potentially misclassified genes by curation errors in pathway databases are likely to be automatically identified and down-weighted by integrating gene functional associations. Third, genes with many non-specific associations to other genes across various pathways are statistically readjusted in these pathways, by considering the whole association network in a hypergeometric model. This specificity seems not directly targeted by other algorithms, although, from a different aspect, SEPEA uses a distance rule to make some control over genes with high non-specific associations 50.

It is also worth noting that GANPA provides a generally applicable solution to the “over-counting” issue associated with multi-subunit proteins, which is implicit in classical GSA. The fundamental cause of this problem is that subunit-encoding genes are regarded equivalent to protein-encoding genes in classical gene set significance analysis. This is especially critical when a pathway contains a not-so-relevant multi-subunit protein with a number of subunit genes, as illustrated in the VEGF pathway example in this work. By combining both expressions and associations of subunit-encoding genes as a single component, we showed that GANPA works well for this issue in pathway analysis.

With the simple but effective MeanAbs and its corresponding weighted form W-MeanAbs on various datasets ranging from the specific p53 data to a relatively simple asthma disease, and to three studies on the complex breast cancer diseases, we demonstrated that our gene-weighting approach is biologically sound and effective for diverse datasets, and highly reproducible for comparisons across different studies. Taken together, we designed a network-based statistical algorithm to account for the constitutive nonequivalence of genes within pathways, proposed a “weighted gene set” model for biological pathways that can be integrated with regular GSA methods, and demonstrated the feasibility of this novel general-purpose approach for pathway analysis, which is therefore of a methodological and biological significance for future research.

Materials and Methods

Microarray data and preprocessing

The p53 dataset was downloaded from GSEA website (http://www.broadinstitute.org/gsea/). The expression data were already collapsed to genes. If essential, we replaced negative or absent measures with the row median, and removed rows with non-positive values. We log transformed the scale for downstream analysis. Asthma (GSE18965) and breast cancer datasets (GSE14548, GSE10780 and GSE3744) were downloaded from NCBI GEO database (http://www.ncbi.nlm.nih.gov/geo/). Raw data were preprocessed with customized CDF 52 using the RMA method 53, 54, 55.

Gene sets and gene functional association networks

Gene sets were downloaded from GSEA website (http://www.broadinstitute.org/gsea/). To compare with pathways identified by other methods using the p53 data, we used the earlier version accompanied by the previous study 11, and for all other data, the version 3 gene sets (KEGG, Biocarta, Reactome) were used. Gene sets with inter-gene associations, gene weights and/or expression changes were visualized using Cytoscape 56, 57, an open-source project for network visualization and analysis (http://www.cytoscape.org/).

Gene association network FunCoup 31 version 1.1 was downloaded from website (http://funcoup.sbc.su.se). gNET was constructed de novo with PPIs, co-expressions and GO (gene ontology). For PPIs, the following databases are used: BioGrid (http://thebiogrid.org), HPRD (http://www.hprd.org), DIP (http://dip.doe-mbi.ucla.edu), MINT (http://mint.bio.uniroma2.it/mint), IntAct (http://www.ebi.ac.uk/intact) and Reactome (http://www.reactome.org). Only PPIs with at least one PubMed reference are used. For co-expressions, we collected the raw data of 73 human gene expression microarray datasets (each >20 samples, see Supplementary information, Data S2) consisting of 2 798 samples from NCBI GEO database, and only gene pairs with >0.8 (or <−0.8 for negative associations) Pearson's correlation coefficient in at least 5 datasets were extracted. None of the datasets used for pathway analysis throughout this work were included in generation of co-expressed gene pairs to avoid potential bias caused by data reuse (Supplementary information, Data S2). For GO BP annotation resources (GOA, September 2010), we discarded too general terms for specificity concerns, and extracted only gene pairs sharing at least one lowest-level BP term in the GO hierarchy graph (Supplementary information, Data S1). By integrating the three types of gene associations, we obtained 1 278 362 unique gene pairs.

Network-based gene weighting in pathways

Suppose that the genome contains N total genes. Given a gene functional association network of this genome and a pathway S consisting of K genes, we have two variables to characterize the functional property of any gene Gi (i=1, 2, ..., K) in S: the number of associations between Gi and the K genes in S, designated as Xi, and the number of associations between Gi and the N genes in the genome, designated as Mi. Suppose that Gi has no specific functional associations with genes in S; then Xi is expected to follow a hypergeometric distribution,

with the expectation derived as

For gene Gi (i=1, 2, ..., K), the observed number of associations between S and it (Xi) is likely to be significantly larger than E(Xi), when there is a specific functional association between Gi and S. Therefore, we proposed to use a rescaled form of Xi-E(Xi) to quantify the relative association strength Wi between Gi and S,

where IA is an indicator function for the positive subset A: (0, +∞) on R, and a is some appropriate scaling base for which we used 2. Notice that the weight Wi defined in this way has a minimum value of 1, which is a basic-level weight for genes within a pathway (also see Results).

Non-weighted and weighted pathway significance tests

Gene expression microarray data are first summarized with a gene statistic, for example two-sample Student's T statistic, to evaluate differential expression. For a gene set S consisting of genes Gi (i=1, 2, ..., K), the observed gene statistic can be obtained as Ti. Pathway statistic can be derived from the gene statistics. The MeanAbs and corresponding W-MeanAbs statistics are the following:

The standard weighted form of GSEA (see original paper 11 for more details) and its corresponding W-GSEA statistics are as follows:

where,

and the maximum deviation function,

Note that the original GSEA also incorporates a “weight” |ri|, which is the correlation of Gi's expression profile with the phenotype profile, and is therefore dataset-dependent. On the other hand, the pathway gene weights proposed by us are irrespective of datasets. Therefore, it should be sound to multiply these two types of gene weights and integrate them into the KS statistic to derive a new version of GSEA, referred to as W-GSEA.

For single hypothesis tests with GSEA and W-GSEA, we followed exactly the standard GSEA methodology. Briefly, we applied the following steps: generation of permutation distribution, normalization of pathway statistic, and calculation of permutation P values. For single-hypothesis tests with MeanAbs and W-MeanAbs, we will describe the methodology in detail below. Given a gene set S consisting of K genes, the gene randomization distributions of SMeanAbs are generated from 1 000 random gene sets of size K, and designated as SMeanAbsrand; the sample permutation distributions of SMeanAbs are generated from 1 000 times the sample label permutations, and designated as SMeanAbsperm. The randomization and permutation versions of SMeanAbs can be similarly obtained as SW-MeanAbsrand and SW-MeanAbsperm, respectively. Based on the randomization (or permutation) distribution, P values prand (or pperm) can be computed by summarizing the probability of expecting a larger observation than the observed SMeanAbs. When sample size is sufficient, pperm is strict since inter-gene associations are retained by permutation. Throughout this work, we prefer to use pperm over prand. It should be noted that the SMeanAbsperm distribution is dependent on gene set size K, and therefore the observed statistics are not directly comparable among gene sets of different sizes. To correct this, we applied a simple normalization (or standardization) step to the pathway statistic, by subtracting the median of SMeanAbsperm and dividing by the standard deviation of SMeanAbsperm. The normalized pathway statistic, designated as NSMeanAbs, can be shown to be roughly comparable among pathways and also comparable between statistics with and without gene weight incorporation (Supplementary information, Data S1).

During pathway analysis, it is quite common to involve a number of pathways, which requires a multiple-hypothesis testing procedure to control global error rates. We applied Benjamini-Hochberg FDR method 58 to account for false discoveries during all GSEA/W-GSEA or MeanAbs/W-MeanAbs analysis, generating FDRrand and FDRperm for prand and pperm, respectively. Similarly, we prefer FDRperm over FDRrand.

Correction for multi-subunit proteins

Proteins that have multiple subunits each encoded by a different gene, causing an overestimated weight, should be considered and treated with some appropriate method. We took the combination approach, in which gene-gene linkages were mapped to the protein level, with subunit-encoding genes integrated into a single component. Both protein-encoding genes and subunit-encoding genes are weighted by calculating protein-level association strength between the gene and pathway, using exactly the same gene-weighting methodology described above.