A novel path-specific effect statistic for identifying the differential specific paths in systems epidemiology

Biological pathways play an important role in the occurrence, development and recovery of complex diseases, such as cancers, which are multifactorial complex diseases that are generally caused by mutation of multiple genes or dysregulation of pathways. We propose a path-specific effect statistic (PSE) to detect the differential specific paths under two conditions (e.g. case VS. control groups, exposure Vs. nonexposure groups). In observational studies, the path-specific effect can be obtained by separately calculating the average causal effect of each directed edge through adjusting for the parent nodes of nodes in the specific path and multiplying them under each condition. Theoretical proofs and a series of simulations are conducted to validate the path-specific effect statistic. Applications are also performed to evaluate its practical performances. A series of simulation studies show that the Type I error rates of PSE with Permutation tests are more stable at the nominal level 0.05 and can accurately detect the differential specific paths when comparing with other methods. Specifically, the power reveals an increasing trends with the enlargement of path-specific effects and its effect differences under two conditions. Besides, the power of PSE is robust to the variation of parent or child node of the nodes on specific paths. Application to real data of Glioblastoma Multiforme (GBM), we successfully identified 14 positive specific pathways in mTOR pathway contributing to survival time of patients with GBM. All codes for automatic searching specific paths linking two continuous variables and adjusting set as well as PSE statistic can be found in supplementary materials. The proposed PSE statistic can accurately detect the differential specific pathways contributing to complex disease and thus potentially provides new insights and ways to unlock the black box of disease mechanisms.


Background
Biological pathways play a key role in the occurrence, development and recovery of complex diseases, such as cancers, which are multifactorial complex diseases that are generally caused by mutation of multiple genes or dysregulation of pathways [1]. Besides biological pathways, improving clinical treatment, and discovering drug targets and biomarkers, are a series of actions among molecules (including genes and protein, etc.) in a cell that lead to a certain product or a change in the cell [2][3][4][5][6]. Recently, with the still-ongoing development of highthroughput sequencing technology, the price is obviously falling, large numbers of related-pathway omics data are exponentially growing, thus it has become one of the most important resource to analyze biological pathways via high-throughput omics data [7]. During the past 10 years, several pathway knowledge databases have been built, such as KEGG, BioCyc, MetaCyc, Reactome, RegulonDB and PantherDB [8][9][10][11][12][13]. The establishment of these knowledge databases laid the foundation to study pathways contributing to the occurrence, development and recovery of complex diseases [14]. Pathway-related knowledge databases and omics data contain a wealth of disease-related knowledge and information, such as information on the related-pathway genes, molecular interactions in the same pathway, topology structure of pathways, gene expression, and so on. However, how to reveal the mechanism of biological regulation (e.g. SNP, gene and protein) on complex disease from observational pathwayrelated omics databases has become a great challenge.
Recently, some pathway analytical methods have been proposed to study human physiology, systems biology and modern drug development that provided the computational framework for data pathway analysis and biomarker selection [11][12][13][14][15][16][17]. These methods include functional enrichment analysis or gene set analysis (GSA) [14], pathway analysis within a Bayesian Network framework [15], pathway analysis approaches based on the adaptive rank truncated product statistic [16] and a sub-pathway-based approach to studying the joint effects of multiple genetic variants [17]. Although, these methods are suitable for omics data analysis in systems epidemiology, most of them fail to take into account the correlation degree and topological structure between nodes (e.g. gene, SNP, etc.) from biological network. Despite, Pathway Effect Measures (PEM) with a casecontrol design [13] fully utilizes the correlation relationship between nodes, it only considers the chain-specific effects and encounters difficulties in non-linear and interaction models. Specially, the estimation of chainspecific effect is different from the path-specific effect extracted from a complex network, the former one does not take into account the influence of other adjacent paths or nodes (e.g. parent or child nodes). Besides the chain effect is solely marginal statistical association, but the specific path effect is developed based on causal inference and needs to adjust for necessary covariates affecting specific path. Pearl [18] firstly defined pathspecific effects in the terms of causal diagrams. And Avin et al. [19] provided general necessary and sufficient conditions for their identification for a single exposure and outcome, while Shpitser [20] generalized these definitions and conditions to settings with multiple exposures, multiple outcomes, and possible hidden variables. Miles [21] developed a suite of multiply-robust, semiparametric efficient estimator for the path-specific effect. However, these methods tend to require a number of strict assumptions which are difficult to be verified in  practical applications, especially for complex network structures in biological fields. In order to reduce the computational burden, we proposed a series of simplification process for the topology structure of complex networks. Of note, the nodes on specific path are only influenced by their parent nodes according to Markov Independence property. After simplification, the path-specific effect statistic PSE is estimated under two conditions to detect the differential specific paths. Therefore, the statistic PSE combined the causal effect calculation under causal inference framework with the network comparison in systems epidemiology designs. To assess the performances of the statistic PSE, theoretical proofs and statistical simulations are conducted to evaluate the stability of type I error and power, and a real gene expression data in Mammalian Target Of Rapamycin (mTOR) pathway on survival time of glioblastoma multiforme (GBM) patients are further analyzed to validate the practicability of PSE statistic.

Application
Gliomas are the most common type of primary brain tumor, and are histologically differentiated as astrocytomas, oligodendrogliomas, and ependymomas. The World Health Organization (WHO) classifies central nervous system tumors into four different grades: I, II, III and IV. Grade IV glioblastoma multiforme (GBM) is the most frequent, devastating, and malignant astrocytic glioma. It is characterized by a high degree of cellularity, vascular proliferation, tumor cell chemoresistance, and necrosis. Even after neurosurgical resection, followed by aggressive chemotherapy and radiotherapy, GBM is still considered an incurable malignancy. No effective treatment agent against GBM has been identified [22][23][24].
The proposed PSE statistic was applied to analyze gene expression data in Mammalian target of rapamycin (mTOR) signal pathway ( Fig. 1) of 461 white people from TCGA datasets containing 12,071 genes by comparing the survival time (i.e. more VS. less than the mean survival time), and 39 genes are successfully mapped to this signaling pathway. The pathway mTOR, a key mediator of phosphatidyl-inositol-3-kinase (PI3K) signaling, has emerged as a compelling molecular target in glioblastoma patients, although clinical efforts to target mTOR have not been successful. Here, we support the evidence demonstrating that mTOR is a compelling molecular target for the survival event with GBM. It was approved by ethical committee of Medical Ethical Committee of Qilu Hospital, Shandong University, China.

Simulation
Type I error rate Tables 1 and 2 showed the type I error rates of total causal effects (TCE) of Calorific Excess on Myocardial Infarction and the path-specific effects along selected the specific path: Calorific Excess→Visceral Adiposi-t→Inflammatory Milieu→Atherosclerosis→Myocardial Infarction (Fig. 2), respectively. Table 1 revealed that the type I error rates of five methods are close to the given nominal level (α = 0.05) when sample size reached 1000 for total causal effects. While Table 2 illustrated that only permutation tests remained stable at the nominal level of 0.05, other methods deviated from the 0.05 nominal level, when sample size reached 1000 for pathspecific effects. Therefore, PSE statistic with permutation tests had better performances for testing total causal effect or path-specific effect. Table 3 showed that the powers of five methods almost remained invariant for testing total causal effects when varying across the average causal effects of edges on specific path and given the fixed effect difference δ = 1 (Case group vs. Control group). Table 4 showed the power of permutation tests got larger for path-specific effects when the average causal effect of each edge on target path became larger.

Statistical power
Besides, Tables 5 and 6 showed that varying across the effect difference δ of path-specific effects in case and control group, the powers of total causal effect and path-specific effect obviously elevated. Furthermore, we performed sensitivity analysis to observe whether the PSE statistic would be influenced by the parent nodes or child nodes of nodes on specific path. Tables 7 and 8 revealed that in most cases path-specific effect statistic PSE was not influenced by effects of their parent nodes or child nodes. According to above simulation performances, our proposed PSE with permutation test had better performances and kept robust in sensitivity analysis.
For the scenario of continuous variables, when comparing with the PEM [13] statistic with Bootstrap tests, our proposed PSE statistics accurately detected the differential pathway effects X 1 → X 2 → Y linking X 1 and Y under two conditions for the fixed effect difference. The PEM with Bootstrap tests detected some false positive specific pathways (Fig. 3).

Application results
Mammalian Target Of Rapamycin (mTOR), a key mediator of phosphatidyl-inositol-3-kinase (PI3K) signaling, has emerged as a compelling molecular target in glioblastoma patients, although clinical efforts to target mTOR have not been successful [22][23][24]. Figure 1 showed the mTOR pathway from KEGG dataset (www. kegg.jp) that have been verified to be associated with the survival time of glioblastoma multiforme (GBM). The data (sample size N = 461 white people) of this pathway ( Fig. 1) containing 39 genes in red boxes were downloaded from "The Cancer Genome Atlas" (TCGA, https://cancergenome.nih.gov/). We stratified the pathspecific effects according to the survival time T (T = 1 if survival time larger than mean survival time 16.65 months of patients diagnosed with GBMs, otherwise T = 0) and adjusted for confounders including age and sex in white people. Furthermore, we found 14 specific pathways with statisical significance (Table 9) contributing to GBM and corresponding 17 genes, SLC7A5, mLST8, Lipin-1, Tel2, CLIP-170, ATG1, SLC3A2, RNF152, eIF4B, GATOR1, STRAD, IGFR, IRS1, PDK1, TSC1/6, Rheb. These genes have also been verified in many studies. The pathway mTOR works through the PI3K pathway through 2 important complexes, each characterized by distinct binding partners that confer different activities. In complex with PRAS40, raptor, and mLST8/GbL, mTOR works as a downstream activator of PI3K/Akt signaling, associating growth factor signals with protein translation, cell growth, proliferation, and survival state. This complex is known as mTORC1. In complex with rictor, mSIN1, protor, and mLST8 (mTORC2), mTOR works as an upstream effector of Akt [24]. Upon growth factor receptor-mediated activation of PI3K, Akt is recruited to the membrane through the interaction of its pleckstrin homology domain with PIP3, thus exposing its activation loop and enabling phosphorylation at threonine 308 (Thr308) via the constitutively active phosphoinositide dependent protein kinase 1 (PDK1) [25][26][27]. Akt activates mTORC1 via inhibitory phosphorylation of TSC2, which along with TSC1, negatively regulates mTORC1 through inhibiting the Rheb GTPase, a positive regulator of mTORC1. mTORC1 impairs PI3K activation in growth factor-stimulated cells by downregulating IRS 1 and 2 and PDGFR [24,28,29]. The pathway mTORC1 regulates SREBP via regulating the nuclear entry of lipin 1, a phosphatidic acid phosphatase. Dephosphorylated, nuclear, catalytically active lipin 1 help nuclear remodeling and mediate the effects of mTORC1 on SREBP target gene, SREBP promoter activity, and nuclear SREBP protein abundance. Inhibition of mTORC1 in the liver significantly impairs SREBP function and makes mice resistant, in a lipin 1-dependent fashion, to the hepatic steatosis and hypercholesterolemia induced by a high fat and cholesterol diet. These findings developed lipin 1 as a key component of the mTORC1-SREBP pathway [25]. Some studies provided evidence that ATG1 was the preferred translation initiation site in 8MGBA, and that endogenous SETMAR were very stable proteins [25]. In summary these data allowed us to propose that endogenous SETMAR proteins can contain the α-peptide in their N-terminal part, at least at some stages of GBM biogenesis [26]. The gene Rheb acts downstream of TSC1/TSC2 and upstream of mTOR to regulate cell growth. Both IGF-IR and IGF-IIR were overexpressed in GBMs compared with normal brain (P < 10(− 4) and P = 0.002, respectively). Moreover, with regard to standard clinical factors, IGF-IR positivity was identified as an independent prognostic factor associated with shorter survival (P = 0.016) and was associated with a less favourable response to temozolomide [27]. The pathway mTOR regulates eIF4B phosphorylation in response to amino-acid refeeding [30]. Glioblastoma is the most aggressive brain cancer with the poor survival rate. A microRNA, miR-451, and its downstream molecules, STRAD, are known to play a centrol role in regulating a biochemical balance between rapid proliferation and invasion in the presence of metabolic stress in microenvironment [31].

Discussion
System epidemiology couples traditional epidemiology with modern high-throughput technologies which seek to integrate pathway-based (or network-based) analysis into observational study designs to enhance the understanding of biological processes in the human organism. It provides a ways to organize and study the interdependencies of factors (e.g., genes, proteins, metabolites) at a human population level. Within this framework, the identification of pathways effects responsible for specific diseases has been one of the essential tasks.
In the framework of bioinformatics, various methods existed for inferring biological networks aiming to mine underlying networks for identifying biological modules, clustering interactions, and topological features of the network such as degree and betweenness centrality [32][33][34]. Despite these procedures for distinguishing specific pathway (or network) topology between different disease status, statistical inference at a population level remains unsolved, and further development is still necessary. Because, in practice, complexity of network tend to render it difficult to accurately detect the pathway contribution to disease, the simplification process of complex network is very crucial for identifying the target pathways. Based on the aim of identification of pathspecific effects, we proposed a series of simplification process to simplify and abstract the topology structure  of complex network (Fig. 4). Of note, the nodes of pathspecific is only influenced by their parent nodes according to Markov Independence property. This simplification process greatly reduce the complexity of network structure and maintain the key factors affecting the target specific paths. Currently in the field of causal inference, most methods mainly focus on the simple and easily understandable causal diagrams, but the simplification is the crucial first step to feasibly serve to real world. After simplification, calculation and tests of pathspecific effect also became feasible. We proposed a nonparameter riverway conflux-based non-parameter causal diagram model for identifying the path-specific effects in systems epidemiology. Simulation studies revealed our proposed PSE with permutation tests could efficiently identify the statistically different pathways. Table 1 revealed that the type I error of five methods are close to the given nominal level (α = 0.05) when sample Jsize reached 1000. While Table 2 illustrated that only PSE with permutation tests remained stable, other methods deviated from the nominal level 0.05, when sample size were larger than 1000. Therefore, PSE statistic with permutation tests had better performances for testing total causal effect or path-specific effect. Table 3 revealed that the powers of five methods almost remained invariant for total causal effect when the average causal effects β of edge on the specific path became larger and the effect difference δ was set to 1. On the contrary, the power of permutation tests got larger and was close to 1 for the path-specific effect as average causal effect went up. Besides, Tables 5 and 6 revealed that varying across the effect differences δ, the power on total causal effect and path-specific effect obviously elevated. Furthermore, we performed sensitivity analysis to observe that in most situations, PSE statistic would be not influenced by the parent nodes or child nodes of nodes on specific path (Tables 7 and 8).
In application analysis, the proposed typical PSE statistic was applied to analyze gene expression data in Mammalian target of rapamycin (mTOR) signal pathway ( Fig.  1) of 461 white people from TCGA datasets containing 12,071 genes, 39 genes are successfully mapped to this signaling pathway. The pathway mTOR, a key mediator of phosphatidyl-inositol-3-kinase (PI3K) signaling, has emerged as a compelling molecular target in glioblastoma patients, although clinical efforts to target mTOR have not been successful. Here, we support the evidence demonstrating that mTOR is a compelling molecular target in GBM. Figure 1 showed the mTOR pathway from KEGG dataset (www.kegg.jp) that have been verified to be associated with the survival time of glioblastoma multiforme (GBM) [32][33][34]. The data (sample size N = 461 white people) concerning this pathway containing 39 genes in red boxes were downloaded from "The Cancer Genome Atlas" (TCGA). Our studies results obtained 14 statistically significant positive pathways (Table 9). We stratified the path-specific effects according to the survival time T (T = 1 if survival time larger than mean survival time 16.65 months of patients diagnosed with GBMs, otherwise T = 0) and adjusted for confounders including age and sex. Furthermore, we found 14 statistically positive specific pathways (Table 9) and most gene have been verified in many studies.

Conclusion
In the framework of systems epidemiology, the proposed permutation-based PSE are valid and powerful for detecting the specific pathway effect contributing to disease, thus potentially providing new insights and ways to unlock the black box of disease mechanism.   Table 8 The performances of PSE with permutation tests varying across the effect differences of edges from nodes on target path to their child nodes not on target path

Methods
Complex network simplification rules (Fig. 4) For specific path in complex network, we proposed some rules to simplify complex networks and extract specific path from complex network. Remove irrelevant variables from the causal diagram including (i) no causal effect on the variables of target path (e.g. C 1 in Fig. 4) and (ii) no causal effect on any variable in the adjustment set (e.g. C 2 in Fig. 4). These variables will not induce spurious association so can be ignored. Besides considering the influence of direct and indirect causal effect, we need to merge all direct and indirect causal paths between two variables. Finally, confounding paths or non-causal path remained in simplified causal diagram paths.

Path-specific effect statistic PSE
For the sake of illustration, we take the specific path X 1 → X 2 → Y (Fig. 5) as an example. We wish to calculate the path-specific effect based on the average causal effect. Firstly, according to the expectation dependence of X 1 and Y, we have and ∂EðY jx 1 Þ Take the causal diagram depicted in Fig. 5a as a special case, the average causal effect (ACE) of X 1 on Y is used to compare the effects of two different levels of X 1 , i.e., Fig. 3 The performances of PSE and PEM statistics for detecting three pathways Table 9 The detected pathways with statistical significance contributing to survival time in GBM patients 1 and x ″ 1 . Since p(c| do(x 1 )) = p(c) and p(y| c, do(x 1 )) = p(y| c, x 1 ), we obtain Similarly, we obtain the ACE of X 1 on X 2 as From p(c| do(x 2 )) = p(c) and p(y| c, do(x 2 )) = p(y| c, x 2 ), we have

Case of continuous variables
We first consider the case of continuous variables depicted in Fig. 5b.
Applying integration by parts and definite integration: If the distribution dependence is non-decreasing, so is the expectation dependence.
Theorem 1: For the specific path X 1 → X 2 → Y with confounders C, any x if satisfy the conditions:  Causal diagrams for specific path X 1 → X 2 → Y with C = (C 1 , C 2 ). a C 1 is independent of C 2 ; b C 1 is associated with C 2 ; By Eq. 2, the effect of X 2 on Y can be written as From above two equations, we obtain Similarly, for condition 2, we can obtain Thus we obtain In observational studies, in order to estimate the causal effect, we need to adjust for the parent nodes of nodes on the specific path. For instance, for the causal diagram in Fig. 6, according to the back-door criteria and do-calculus [18], the specific path effect of X 1 → X 2 → Y, we need to separately adjust for C 1 and C 2 by linear regression as follows, Case of discrete variables Similarly the results for case of discrete variables can be proved by substituding the partial differentiation and the integration into differencing between adjacent level and summation, respectively. We have fpðX 2 ≤mjc; x 1 Þ−pðX 2 ≤m−1jc; x 1 ÞgEðY jc; X 2 ¼ mÞ Thus, similar to Eq. 4, we obtain −EðY jc; X 2 ¼ mÞg: Similar to Eq. (1), we have (2), for binary X 1 , X 2 and Y, and C is a discrete variable which may have multiple values under the condition of [E(Y| c, X 2 = m + 1) − E(Y| c, X 2 = m)] ⊥ C. We have Extension to the case of multiple mediators In specific path X 1 → X 2 → ⋯ → X K → Y with continuous confounders C, for any x In observational studies, according to back-door criteria and do-calculus [18], for the causal diagram in Fig. 5, the specific path effect of X 1 → X 2 → Y via adjusting for the binary parent nodes C 1 and C 2 is Path-specific statistic for two comparisons The proposed path-specific effect statistic (PSE) was based on product of average causal effect (ACE) of each directed edge, and took difference under two conditions (e.g., exposure vs. non-exposure, case vs. control). In order to identify the specific path X 1 → Y the standardized path-specific effect in the exposure or case group was defined as While for non-exposure or control group, the standardized path-specific effect was The proposed PSE statistic was developed to test the difference of path-specific effects under two conditions.

Non-parametric permutation and bootstrap tests of PSE
To test whether the specific path contributed to the disease end point, we conducted a series of hypothesis tests. The permutation-based hypothesis tests were conducted as follows: 1) draw a large number of data on disease status (e.g., case and control group) without replacement and estimate PSE in each group, and make difference between two groups and then forms our statistic PSE; 2) Repeat this process to form a permutation distribution under the condition H 0 is true; 3) obtain the value of statistic actually observed in study and locate the value in permutation distribution to get the P value; 4) reject Fig. 6 The causal graph linking X 1 and Y in case and control groups. The dash colored line denotes the differential directed edge and X 1 → X 2 → Y is the unique differential path linking X 1 and Y