Signalling pathway impact analysis based on the strength of interaction between genes

Signalling pathway analysis is a popular approach that is used to identify significant cancer‐related pathways based on differentially expressed genes (DEGs) from biological experiments. The main advantage of signalling pathway analysis lies in the fact that it assesses both the number of DEGs and the propagation of signal perturbation in signalling pathways. However, this method simplifies the interactions between genes by categorising them only as activation (+1) and suppression (−1), which does not encompass the range of interactions in real pathways, where interaction strength between genes may vary. In this study, the authors used newly developed signalling pathway impact analysis (SPIA) methods, SPIA based on Pearson correlation coefficient (PSPIA), and mutual information (MSPIA), to measure the interaction strength between pairs of genes. In analyses of a colorectal cancer dataset, a lung cancer dataset, and a pancreatic cancer dataset, PSPIA and MSPIA identified more candidate cancer‐related pathways than were identified by SPIA. Generally, MSPIA performed better than PSPIA.


Introduction
With the establishment of pathway databases, including the KEGG, BioCata, and Reactome databases, analysis of differentially expressed genes (DEGs) has become a dominant analytical method in systemic biological research. In such analysis, the first step is to list DEGs according to gene expression profiles. Next, pathways with significant number of DEGs are identified, after which gene ontology functional enrichment analysis is performed on sets of DEGs in affected pathways, allowing researchers and clinicians to better understand interactions between diseases and genes. Early methods of pathway analysis mainly included techniques based on overexpression analysis (ORA) and functional class scoring (FCS) [1][2][3]. ORA methods, such as Onto-Express [4,5] and GOEASE [6], determine differentially expressed pathways mainly according to the number of DEGs (NDE). FCS methods, such as gene set enrichment analysis (GSEA) [7], take into consideration coordinated variation of DEGs in each pathway. The main disadvantage of ORA and FCS methods is that they do not consider the position and interaction (activation or inhibition) of genes in signalling pathways. In order to overcome this disadvantage, Tarca et al. [8] introduced signalling pathway impact analysis (SPIA), which was the first signalling pathway analysis method to consider the impact of DEGs on pathway perturbation. Voichita et al. [9,10] proposed a gene weighting method to avoid human participation in significance determination during DEG screening, after which they applied SPIA to determine the importance of individual genes in signalling pathways. Ullah [11] improved the SPIA method by substituting the fold-change used in SPIA with the t-value produced by the limma software package to increase the accuracy of SPIA. Korucuoglu et al. [12] depicted gene relationships within signalling pathways as directed acyclic graphs (Bayesian network) and proposed Bayesian pathway analysis (BPA). Li et al. [13] integrated SPIA with subpathway recognition, proposing a subpathway recognition method based on a minimum spanning tree. Additionally, Zhao et al. integrated information related to protein-protein interactions and microRNAs to identify disease-associated pathways [14][15][16].
In a signalling pathway, the effect of upstream gene perturbation on downstream genes is closely related to the intensity of the interaction between such pairs of genes. In order to simplify model parameters, SPIA only considers the relationships of activation and inhibition; it does not consider the intensity of interactions between genes. However, interaction intensity can be measured by Pearson correlation analysis or mutual information computation for gene expression profiles. Therefore, we developed modified SPIA methods based on Pearson's correlation coefficient (PSPIA) and mutual information (MSPIA). Through the application of PSPIA and MSPIA to colon cancer, lung cancer, and pancreatic cancer datasets, we found that, in comparison with SPIA, GSEA, and BPA, proposed PSPIA and MSPIA recognise more pathways related to diseases and have more stable performance.

Data sets
In this paper, we use the following three cancer data sets: a colon cancer dataset, a lung cancer dataset, and a pancreatic cancer dataset. The colon cancer dataset included 12 colon cancer samples and 10 normal samples (the identification of the datasets in Gene Expression Omnibus dataset (ID) = GSE4107) [17]. The lung cancer dataset obtained by Li-Jen et al. (ID = GSE27262) [18,19]  database one by one. Next, the graphite package [21] was used to reconstruct each signalling pathway into a gene network. Finally, the DEGs identified using the limma package for R were mapped onto each gene network.

Significance analysis of signalling pathways [8]
Tarca et al. argued that the results of expression profile differences in one signalling pathway mainly include the overrepresentation of DEGs and the abnormal perturbation in a given subpathway. SPIA defines a new significance evaluation index, P G , which is calculated by the following formula where c = P NDE × P PERT , and P NDE and P PERT represent the significance of the DEG number and differential signal-induced perturbation, respectively, in a given pathway. The first probability P NDE = P(X ≥ N de |H 0 ) captures the significance of a given subpathway P i by an over-representation analysis of the NDE contained in the pathway. H 0 represents the null hypothesis, in which random DEGs appear in a given subpathway. The probability P NDE is obtained by assuming that the NDE follows a hypergeometric distribution. If the whole genome has a total of m genes, of which t are involved in the pathway under investigation, while the set of genes submitted for analysis has a total of n genes, of which r are involved in the same pathway, then the p-value can be calculated to evaluate the significance of the enrichment of a group of DEGs for that pathway as follows The second probability P PERT is calculated based on the amount of perturbation measured in each pathway. The gene perturbation factor is defined as where the term ΔE(g i ) represents the signed, normalised, measured expression change of gene g i (log fold-change if two conditions are compared). The second term in (3) is the sum of the perturbation factors of the gene g j directly upstream of target gene g i , normalised by the number of downstream genes of each such gene N ds (g j ). β ij quantifies the interaction between genes g i and g j (activation or inhibition). SPIA uses all |β| = 1 in order to minimise the number of model parameters. Detailed information can be found in [8].
To consider the influence of the interaction strength between two adjacent genes with different intensities on perturbation, formula (3) was modified as follows where w ij represents the intensity of the interaction between two genes. For PSPIA, the interaction intensity is represented by Pearson's correlation coefficient and w ij is calculated as follows where n represents the sample size of the expression profiles, E ik represents the expression value of gene g i in the kth sample; E i represents the average expression value of gene g i ; and s E i represents the standard deviation of the expression values of gene g i .
For MSPIA, mutual information is used to describe the non-linear interaction strength of two genes and w ij is calculated as follows: To calculate mutual information, we binarised the normalised microarray data as 1 (overexpression) and 0 (underexpression). X and Y represent the binarised expression value space of genes g i and g j , respectively, such as X = {1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0}. Then, the probability space of X is as follows Moreover, the probability space of Y is as follows The joint probability space of X and Y is as follows Finally, the mutual information between genes g i and g j can be caculated as formula (6).

Results and discussion
In this section, we compared the proposed PSPIA and MSPIA methods with the SPIA method by analysing three cancer data sets. Additionally, we also presented the results of the GSEA method and the recently proposed BPA method. The SPIA R package developed by Tarca et al. was applied directly to perform SPIA. GESA was performed using the enrichment analysis software developed by Subramanian. BPA was performed with software developed by Korucuoglu et al. We used the limma software package to identify DEGs as genes with a p-value <0.05. Fair comparisons of methods are difficult because of the unavailability of gold standard cancer-pathway association data. In addition to the differences in the techniques used in the compared methods, the calculations of the significance of the identified pathways are also performed differently. Therefore, one common method of comparison is to evaluate them using a specific p-value threshold for significance. In the PSPIA, MSPIA, SPIA, and BPA results, signalling pathways with p-values (false discovery rate (FDR)) <0.01 were considered as significant pathways. In the GSEA results, signalling pathways with q-values <0.01 were considered significant. Tables 1-3 list the significant pathways possibly related to cancer that were identified via PSPIA, MSPIA, and SPIA, respectively, in the colon cancer, lung cancer, and pancreatic cancer datasets, as well as the P G (FDR) of each pathway. In addition, pathways identified via GSEA and BPA are also indicated. Table 4 lists the number of significant pathways identified in each cancer dataset using the five tested methods.

Comparison of PSPIA, MSPIA, and SPIA results
First, the number of significant pathways identified via PSPIA and MSPIA was much greater than that identified via SPIA. This result suggests that consideration of the interaction intensity of genes improves the sensitivity of SPIA, i.e. the proposed methods can identify more potential disease-related pathways.
Second, some pathways identified via PSPIA and MSPIA, but not via SPIA, were related to particular type of cancer (see the last column of Tables 1-3). For example, among the seven pathways identified through PSPIA, but not through SPIA, in the colon cancer dataset, four pathways were related to colon cancer: colorectal cancer pathway [22], calcium signalling pathway [23], transcriptional misregulation in cancer, and salmonella infection [24]. The colorectal cancer pathway is directly related to colon cancer [22]. The calcium signalling pathway plays an important role in proliferation and migration of colon cancer cells [23]. Salmonella infection can reduce the risk of cancer migration, including that of colon cancer [24]. One pathway was identified via MPSIA in the colon cancer dataset that was not identified by SPIA.
Among the eight pathways identified through PSPIA, but not through SPIA, in the lung cancer dataset, three pathways are related to lung cancer: the natural killer cell-mediated cytotoxicity [25], tight junction [26], and salmonella infection [27]. Natural killer cell-mediated cytotoxicity pathway activation can aggravate human lung cancer. Expression of tight junction proteins, such as claudins, are up-regulated or down-regulated in lung cancer. Among the five pathways identified through MPSIA, but not through SPIA, in the lung cancer dataset, tight junction pathway [26] and salmonella infection pathway [27] are related to lung cancer.
Among the five pathways identified through PSPIA, but not through SPIA, in the pancreatic cancer dataset, four pathways are associated with pancreatic cancer: the gastric acid secretion pathway [28], salmonella infection pathway [29], cell cycle pathway [30,31], and pathogenic Escherichia coli infection pathway [32]. The pathological characteristics of pancreatic cancer indicate that hyperchlorhydria is directly related to pancreatic cancer. Activation of the cell cycle pathway can inhibit proliferation of pancreatic cancer cells. Some experiments suggest that A1-R, a Salmonella species, has a significant inhibitory effect  on low-passage pancreatic cancer cells. Among the eight pathways identified through MPSIA, but not through SPIA, in the pancreatic cancer dataset, seven pathways are associated with this cancer: gastric acid secretion pathway [28], salmonella infection pathway [29], cell cycle pathway [30,31], pathogenic Escherichia coli infection pathway [32], p53 signalling pathway [33], and Wnt signalling pathway [34][35][36].
In comparison with PSPIA, the number of significant pathways identified through MSPIA was larger in the lung cancer and pancreatic cancer datasets, but smaller in the colon cancer dataset. Apart from the quality of the data itself, one possible reason for this difference is that PSPIA was conducted based on the intensity of the linear correlations of genes, while MSPIA was based on the intensity of the more common non-linear relationship (including linear correlations). When the sample size is sufficiently large, MSPIA can capture more non-linear interactions between genes than can PSPIA. Conversely, in the case of a small sample size, the linear correlation intensity of genes reflected by PSPIA is more reliable than that provided by MSPIA.
Third, all pathways identified in the three cancer data sets via SPIA were also identified by MSPIA. In the colon cancer dataset, only the ECM-receptor interaction pathway [37] and PPAR signalling pathway [38], which have been reported to be related to colon cancer, were not identified via PSPIA. In the lung cancer dataset, only the calcium signalling pathway and salivary secretion were not identified via PSPIA. The calcium signalling pathway has been reported to be related to lung cancer. In the pancreatic cancer dataset, PSPIA identified all pathways identified via SPIA.
The pancreatic cancer pathway is obviously a significant pathway associated with pancreatic cancer, but it could not be identified by PSPIA, MSPIA, and SPIA with p-value <0.01. The p-values of this pathway were 0.058, 0.059, and 0.13 by PSPIA, MSPIA, and SPIA respectively. Therefore, the PSPIA and MSPIA methods identified this pathway more significantly than the SPIA method. This indicates that the two proposed methods are more efficient than the SPIA methods.
Finally, some pathways were identified by PSPIA, MSPIA, and SPIA that might have no association with the corresponding cancer. For example, the Parkinson's disease pathway, Alzheimer's disease pathway, and Huntington's disease pathway were identified in the colon cancer dataset by PSPIA, MSPIA, and SPIA. These anomalous results might have been caused by the quality of the microarray data set or by the inherent limitations of SPIA-based methods. The anomalous pathways were easily eliminated by the subpathway method [13].
In the three data sets, all pathways had p-values below 0.1, with the exception of the p53 signalling pathway, which had relatively larger p-values in the significance analysis by each of the three methods for the colon cancer and lung cancer datasets. In the PSPIA results, the p-values for regulation of the actin cytoskeleton (colon cancer dataset), ECM-receptor interaction (colon cancer dataset), regulation of the actin cytoskeleton (lung cancer dataset), MAPK signalling pathway (pancreatic cancer dataset), and p53 signalling pathway (pancreatic cancer dataset) were 0.068, 0.042, 0.078, 0.1 and 0.042, respectively. In the MSPIA results, the p-values for regulation of actin cytoskeleton (colon cancer dataset), regulation of actin cytoskeleton (lung cancer dataset), and MAPK signalling pathway (pancreatic cancer dataset) were 0.063, 0.06, and 0.05, respectively. In the SPIA results, the p-values for regulation of actin cytoskeleton (colon cancer dataset), regulation of actin cytoskeleton (lung cancer dataset), MAPK signalling pathway (pancreatic cancer dataset), and p53 signalling pathway (pancreatic cancer dataset), were 0.034, 0.062, 0.11, and 0.056, respectively. The number of pathways identified and corresponding p-values show that MSPIA had the best performance among the three methods, whereas SPIA performed slightly better than PSPIA.

Comparison with other methods
Classical GSEA and BPA methods were also selected for comparison. As shown in Table 4, when the p-value or q-value threshold was 0.01, the performance of the two methods was not stable. In the colon cancer and pancreatic cancer datasets, the two methods failed to identify any relevant pathway. In the lung cancer dataset, GSEA identified only two significant pathways, while BPA identified 29 significant pathways. In general, PSPIA, MSPIA, and SPIA were all better than GSEA and BPA. The main disadvantage of GSEA is that it only considers the total NDE in a pathway, but neglects the perturbation induced by differential signals. The major disadvantage of BPA is the loss of some pathway information caused by deleting some relationships according to the correlation of genes during the transformation of the signal pathway into a directed acyclic graph. In addition, in the case of a small sample size, the correlation may be distorted.

Conclusion
Recognition of pathways related to cancer or other diseases is of great importance for understanding their pathogenetic mechanisms and developing effective therapies. In recent years, researchers have proposed many pathway analysis methods. The advantage of SPIA and major reason for its popularity is the combination of information regarding DEGs and their perturbation within pathways. In real biological pathways, the transmission of a perturbation signal is closely related to the intensity of the interaction between genes. For this reason, in this article based on the SPIA, we adopted PSPIA and MSPIA to obtain the interaction intensity, which was integrated into the transmission process of the perturbation signal. Thus, two signalling pathway analysis methods based on Pearson's correlation coefficient and mutual information were proposed: PSPIA and MSPIA.
With the comparison of results from the colon cancer, lung cancer, and pancreatic cancer datasets, we found that the modified PSPIA and MSPIA methods recognised more signalling pathways possibly related to cancer than did the original SPIA method. In addition, among the significant pathways identified through PSPIA and MSPIA, some pathways were not identified through SPIA, but have been reported as related to some cancers in previous articles. Among the three data sets, MSPIA identified more significant pathways in lung cancer and pancreatic cancer datasets than did PSPIA, but identified fewer significant pathways in the colon cancer dataset. We believe that this difference may be related to the sample size. In the case of a large sample size, mutual information can better interpret the interaction between genes than can Pearson's correlation coefficient. In addition, through comparison of the capacity of the three methods to identify seven common cancer pathways, we found that MSPIA had the best performance, whereas SPIA was slightly better than PSPIA. Therefore, it can be concluded that consideration of the intensity of the interaction between genes can further improve the capacity of SPIA to recognise cancer-related pathways. In general, the recognition capacity of MSPIA was better than that of PSPIA.
We also investigated the pathway recognition capacities of the GSEA and BPA methods in the three cancer datasets. The results of the analyses indicate that the recognition capacity and stability of GSEA and BPA are inferior to those of SPIA and SPIA-based methods.

Acknowledgments
This work was funded in part by the National Science Foundation of China (grant nos. 61272018, 61174162, and 61572367), the Zhejiang Provincial Natural Science Foundation of China (grant nos. R1110261 and LY13F010007), and the Graduate Student Innovation Foundation of Wenzhou University (3162014037).