Inference of patient‐specific subpathway activities reveals a functional signature associated with the prognosis of patients with breast cancer

Abstract Breast cancer is one of the most deadly forms of cancer in women worldwide. Better prediction of breast cancer prognosis is essential for more personalized treatment. In this study, we aimed to infer patient‐specific subpathway activities to reveal a functional signature associated with the prognosis of patients with breast cancer. We integrated pathway structure with gene expression data to construct patient‐specific subpathway activity profiles using a greedy search algorithm. A four‐subpathway prognostic signature was developed in the training set using a random forest supervised classification algorithm and a prognostic score model with the activity profiles. According to the signature, patients were classified into high‐risk and low‐risk groups with significantly different overall survival in the training set (median survival of 65 vs 106 months, P = 1.82e‐13) and test set (median survival of 75 vs 101 months, P = 4.17e‐5). Our signature was then applied to five independent breast cancer data sets and showed similar prognostic values, confirming the accuracy and robustness of the subpathway signature. Stratified analysis suggested that the four‐subpathway signature had prognostic value within subtypes of breast cancer. Our results suggest that the four‐subpathway signature may be a useful biomarker for breast cancer prognosis.

prioritized biomarkers by detecting the correlations of gene expression with survival data and have poorly considered biological interactions and functions among genes.
Biological pathways are models containing structure information, such as interactions, regulation, modifications and binding, between genes. In addition, genes involved in the same pathway often perform a specific biological function together. 7 Pathway databases, such as the Kyoto Encyclopaedia of Genes and Genomes (KEGG), 8 provide useful pathway structure information. Many pathway identification methods, such as signalling pathway impact analysis (SPIA), have effectively uncovered dysregulated pathways underlying complex traits and human diseases. 9 However, entire pathways are often too large to enable accurate interpretation of relevant biological phenomena. In recent years, key subpathway regions representative of the entire corresponding pathway are believed to be more useful in terms of interpreting relevant biological phenomena, and abnormalities of these subpathway regions may contribute to the aetiology of diseases. [10][11][12][13][14][15] Subpathway-GM 10 was proposed to identify diseaserelevant subpathways by integrating information from genes and metabolites and pathway structure information within the given pathway. In the study, 16 statistically significant subpathways were identified as associated with metastatic prostate cancer, of which a subpathway region of histidine metabolism, which was ignored by the entire pathway analysis method, was demonstrated to be associated with prostate cancer cell migration in dose-dependent and time-dependent manners. SubpathwayMiner 15 used a subgraph mining method to find the subpathways in which all genes have highly similar functions, and 36 dysregulated subpathways enriched by differential expression genes were identified as associated with the initiation or progression of lung cancer. These studies showed that abnormal subpathways may play important roles in the progress of cancer.
Moreover, there are some other methods to identify subpathways from pathway topology. For instance, PATHOME (pathway and transcriptome information) first decomposed the pathways into linear paths (subpathways) from the top nodes to leaf nodes and then evaluated the significance of differential expression patterns between cancer and normal tissue along the subpathways. TEAK (Topology Enrichment Analysis frameworK) extracted linear and nonlinear subpathways with pathway topologies and scored them using the Bayes Net Toolbox to fit a context-specific Gaussian Bayesian network for each subpathway. MinePath facilitated the decomposition of pathways into their constituent subpathways, and then, the subpathways were matched with gene expression sample profiles in order to evaluate their functional status and to assess phenotype differential power. These subpathway analysis methods have mainly identified dysregulated subpathways by comparing the expression levels of their involved genes between tumour and normal tissues and then analysing the performance of the methods. In this way, patient-specific clinical or prognosis status of the subpathway is lost before the outcomes are obtained. Identification of prognosis-related subpathways may help to predict the survival of patients with cancer and may provide information for personalized therapy.
This study aimed to infer patient-specific subpathway activities for revealing a functional signature associated with the prognosis of patients with breast cancer. We identified a four-subpathway signature with the ability to predict the overall survival of patients with breast cancer and validated its prognostic value in five independent breast cancer data sets.

| Data summary
We collected six independent breast cancer data sets of 1502 patients. We first used the gene expression data set of 255 breast cancer patients in van de Vijver et al. 6 to identify the prognostic signature. Microarray expression profiling was performed using Agilent microarray technologies. The data set is frequently used in breast cancer studies as it contains abundant clinical characteristics, including age, tumour size, lymph node status, grade and oestrogen receptor status. The patients were randomly divided into a training set and (n = 147) and a test set (n = 148) with almost the same number.
We then enrolled five other independent breast cancer data sets (GSE7390, 16 GSE1456, 17 GSE3143, 18 GSE1992 19 and gene expression array in TCGA 20 ) to validate the prognostic signature. A total of 1207 patients were enrolled in these data sets, and each data set included more than 150 patients. Gene expression profiles were measured by different microarray platforms. We downloaded the clinical characteristics of patients in each set (if provided). For all of the above data sets, the summary of patients and clinical characteristics was listed in Table S1. The breast cancer patients in the training sets were assigned into groups with poor or good prognosis according to the status of patient death or not. In this study, the gene expression levels of the data sets were log2 transformed which may cause the raw data set to be approximate normal distribution. And then the gene expression data were z-score normalized across arrays.
For pathway data, we collected 236 biological pathways from the KEGG database, 8 which contained experimentally verified pathway structure information (eg, interactions, regulation, modifications and binding between genes). For each pathway, we converted the data into an undirected and unweighted gene-gene network on the basis of pathway structure information using the "iSubpath-wayMiner" system. 10,15 2.2 | Inference of patient-specific subpathway activities A subpathway was defined as a local region that induces a single connected component in the entire pathway. The subpathways are believed to be more useful in terms of interpreting the relevant biological phenomena, and abnormalities of subpathway regions may contribute to the progression of diseases. 10 We present a novel method that integrates pathway structure information with gene expression data to infer patient-specific subpathway activities. The HAN ET AL.
| 4305 expression values of each gene were overlaid on its corresponding gene in the pathways to integrate the pathway structure with gene expression. We searched for the subpathways within each pathway whose combined expression levels across the samples were highly discriminative of the status of cancer patients ( Figure 1A). For a particular subpathway sp k , we used a k to denote its vector of activity scores for the cancer patients in a study and used c to denote the corresponding vector of class labels (eg, good vs poor prognosis).
The activity vector a k was derived by normalizing the gene expression values g ij to z-transformed scores z ij ; for each gene, i has a mean l i = 0 and a standard deviation r i = 1 over all samples j. The jth element of a k corresponds to the subpathway activity of sample j, and the individual z ij of each member gene in the subpathway is averaged into a combined z-score with formula (1): where n is the number of genes involved in the subpathway. The discriminative potential of a candidate subpathway was evaluated by many types of statistics; we defined the discriminative score S(P k ) as the t-test statistic derived on the activity vector a between the groups of samples distinguished by the class label vector c.
For each given pathway, a greedy search algorithm as previously described 21 was performed to identify subpathways within the pathway network for which the discriminative scores S(sp k ) were locally maximal. The search algorithm starts from a seeded gene i and expands iteratively. At each iteration, the search algorithm considers addition of a gene from the neighbours of genes in the current subpathway. The additional gene is adopted if it yields a maximal score increase. The search will terminate and output candidate subpathways when no additional gene increases the score over a specified improvement rate r in which (1 + r) 9 S(P k ) to avoid over-fitting the expression data. In this study, we set r at 0.05. Furthermore, only subpathways with more than three genes and less than 50 genes were retained, which avoids overly narrow or broad functional subpathways.

| Selection of subpathways mostly related to prognostic classification
Based on the activity matrix of 922 significant subpathways, a random forest supervised classification algorithm was used to identify subpathways mostly related to prognostic classifications ( Figure 1B).
The reason for using the random forest algorithm is that it could incorporate interactions between subpathways and return measures of subpathway importance. In the algorithm, we implemented an iteration procedure to narrow down the subpathway set in which the least important subpathways were discarded at each iteration step. This strategy has been successfully used for subtype predictions of breast cancer based on gene expression profiles. 23 In detail, we constructed ten thousand trees at each iteration step and set the square root of the number of input subpathways to the size of randomly chosen subpathways at each node of the single classification tree. The important score for each subpathway was estimated F I G U R E 1 Identification of the subpathway signature in the training set. A, Gene expression profiles of tissue samples with phenotypes of good or poor prognosis were transformed into a "subpathway activity matrix." For a given subpathway sp k in a certain pathway network, the activity was a combined z-score derived from the expression of its individual genes. After overlaying the expression vector of each gene on its corresponding protein in the pathway network, subpathways with discriminative activities were identified via a greedy search algorithm. Based on permutation tests, the subpathway activity matrix of significant subpathways was obtained. B, The random forest supervised classification algorithm was used to identify the subpathways that were most related to prognostic classifications. An iteration procedure was implemented to narrow down the subpathway sets by discarding one-third of the least important lncRNAs at each step according to their importance score, and nine subpathways remained. C, Development of a prognostic classifier for all combinations of the nine subpathways using the risk score model. For each subpathway combination, patients were classified into high-risk and low-risk groups according to median risk scores, and the log-rank test was used to evaluate the performance of classifications. The signature with the largest value of Àlog(p) was selected as the final signature on out-of-bag samples by permutation testing, and one-third least important subpathways were discarded at each step. After each discard, we re-estimated the generalization error of the classification on out-of-bag samples. We found that the generalization error changed slightly at first, but it increased sharply when less than nine subpathways were retained ( Figure S1). Thus, nine subpathways mostly related to prognostic classifications were selected among the 922 significant subpathways. Detailed information on the nine subpathways is listed in Table S2. Previously, this algorithm was successfully applied to identify the lncRNAs that are most related to oesophageal cancer. 24 We implemented the algorithm with the R package ranger. 25

| Identification of the prognostic subpathway signature
We developed a signature by selecting a combination of the nine subpathways from the training set ( Figure 1C). There are 511 ( P C i 9 , i = 1, 2, . . ., 9) combinations of the nine subpathways. For each combination, we calculated the risk score for every patient from the activities of the subpathways by using a prognostic score model 26,27 as follows: where S is one of 511 combinations; a k is the activity of subpathway of a specific number of subpathways (i = 1, 2, . . ., 9), the subpathway with the smallest P in the log-rank test was selected for each i ( Figure 1C). After the results were compared, a four-subpathway signature was then defined as the final signature.

| Derivation of a four-subpathway prognostic signature
For prognostic signature analysis, the147 breast cancer patients in the training set were first assigned into groups with poor (41 patients) or good (106 patients) prognosis according to the status of patient death or not. We integrated the pathway structure with gene expression to infer the patient-specific subpathway activities that were associated with overall patient survival ( Figure 1A). In total, 4531 candidate subpathways were identified using a greedy search algorithm. Through the three different permutation analyses, 922 significant subpathways were obtained with the default threshold (Materials and methods). A heat map of the activities of 922 subpathways displayed clear differences between patients with good prognosis and poor prognosis ( Figure S2). The random forest supervised classification algorithm was applied to identify the subpathways most related to prognostic classifications ( Figure 1B). Nine subpathways were selected among the 922 significant subpathways.
We developed subpathway signatures that were constructed by any combination of the nine subpathways ( Figure 1C). In total, there were 511 candidate signatures (Materials and methods). For each candidate signature, we calculated the risk score for every patient based on the activities of the subpathways using a prognostic score model. In the training set, patients were classified into high-risk and low-risk groups according to median risk scores. We further priori- With this risk score formula, a patient in the training set was classified as high risk if the risk score was higher than the median risk score (0.1067) and as low risk if was not.

| A four-subpathway signature predicts overall survival of patients with breast cancer
Patients in the training set were divided into a high-risk group   Figure 2C and Table S3). The clinical characteristics of tumour grade and ER status were significantly different, whereas the other clinical characteristics (age, tumour size and LN status) were not significantly different between high-and low-risk group patients ( Table 2).
We used the prognostic value of the signature to classify patients from the five independent breast cancer data sets (GSE1992, GSE7390, GSE1456, GSE3143 and gene expression array in TCGA) to validate whether the four-subpathway signature had the same or similar prognostic value in different populations (Table S1).
For each independent set, patients were divided into high-risk and low-risk groups, and patients with high-risk scores had significantly shorter overall survival than those with low-risk scores ( Figure 2D-H and Table S3). Specifically, the P of log-rank tests in the data sets of GSE1992, GSE7390 and TCGA were all less than 0.001 ( Figure 2D-F). For the data sets of GSE1456 and GSE3143, the P of log-rank tests were less than 0.05 ( Figure 2G and H). For univariate Cox proportion hazard regression analysis, the HRs of all five validated sets were more than 1.2, and P was less than 0.01 (Table S3). These results indicate that the subpathway signature can be used for classifying patients with breast cancer. Moreover, three data sets (GSE1992, GSE7390 and TCGA) provided various clinical characteristics, and we analysed the differences in clinical characteristics between high-and low-risk group patients (Table S4). ER status differed significantly between the two groups across three data sets, and tumour grade differed significantly between the two groups in the GSE1992 and GSE7390 data sets.

| Survival prediction by the four-subpathway signature is independent of other clinical characteristics
We then assessed whether the survival prediction ability of the values of clinical characteristics in the multivariate Cox regression analysis. The results showed that the four-subpathway signature was an independent prognostic factor for overall survival across four data sets, except for the test set (Table S5). Meanwhile, in the test set, the P of the subpathway signature (P = .0503) has just exceed the significance threshold of 0.05. For patients with ERÀ breast cancer, the four-subpathway signature classified the patients into high-and low-risk groups in the training set (P = .029, Figure S3A) but not in the test set (P = .15, Figure S3B).

|
As with the test set, the four-subpathway signature showed similar prognostic value for ERÀ patients in the TCGA (P = .49, Figure S3C) and GSE7390 (P = .8, Figure S3D) data sets. Patients with ERÀ generally have a relatively poor prognosis. 31  This may result in a non-significant P in the log-rank test.
We also did stratified analyses of patients with HER2+ and HER2À patients from the TCGA and GSE1992 data sets, and obtained similar results (see Appendix S1 and Figure S4).

| Comparison of survival prediction power between clinical characteristics and the foursubpathway signature
We performed time-dependent ROC analysis 32  Vertical hash marks indicate censored data ER status, tumour grade and LN status were calculated using the R package time ROC. 32 The four-subpathway signature showed better prediction of survival than age and tumour size in regard to overall survival across the five data sets (P < .05, Figure 4A-E). The predictive ability of the four-subpathway signature was significantly different in the training set and test set (P = 2.54e-09 and P = .027, Figure 4A and B) compared with the predictive ability of ER status, but it did not show a significant difference in the GSE1992, GSE7390 and TCGA data sets (P > .05; Figure 4C-E). Meanwhile, there was no significant difference in these measures between the four-subpathway signature and tumour grade in the test set and GSE1992 set ( Figure 4B and C).

| All four subpathways of the signature are essential for its prognostic value
We constructed all possible signatures containing one to four subpathways in our signature to confirm that all four subpathways of the signature were essential for the four-subpathway signature.
After the survival of these newly constructed signatures was compared with the original four-subpathway signature using the log-rank test, none of the signatures with fewer than four subpathways was consistently associated with overall patient survival in all seven data sets in our study (see Table S6). Furthermore, we tested the correlation between each pair of the four subpathways in the signature. The results showed that the maximum Pearson correlation coefficient between the four subpathway activities is less than 0.6 (Figure S5). These results indicate that all four subpathways are essential for the prognostic value of the signature.

| Comparison of the prognostic power of the four-subpathway signature and a gene-based signature
We then compared the prognostic power of the subpathway-based signature and a gene-based signature on overall survival. The NKI70 gene signature (MammaPrint) 5 is one of the most commonly used gene-based models for breast cancer prognosis prediction, and it has been approved by the Food and Drug Administration for use in clinics.
We applied the NKI70 method 6

| Functional analysis of the four-subpathway signature
The samples were divided into high-risk and low-risk groups according to the four-subpathway signature. For the signature, the activities of the four subpathways displayed significant differences between the high-risk and low-risk groups in the training set (t-test P < .05, Figure 2A). We then investigated the changes of genes involved in the four subpathways in the context of gene expression data. Also, the expression levels of the involved genes displayed significant differences between high-risk and low-risk groups ( Figure 5).
This result indicates that the subpathways in the signature may have an important biological function associated with the survival of patients with cancer.
We next sought to explore the biological function of the subpathways of the prognostic signature in breast cancer tumorigenesis and development. For path:04390_17 in the hippo signalling pathway, genes in the subpathways were mapped to the original pathway, and a local region was identified ( Figure S8). This local region corresponds to key pathway downstream transcription co-activators: YAP/TAZ, which has been reported to be associated with regulating cell growth, differentiation and apoptosis. 33,34 Specifically, elevated yes associated protein 1 (YAP1) activity has been correlated to a poor prognosis in several cancers, 35,36 TEA domain transcription factor 1 (TEAD1) has been shown to be the major YAP1 partner in breast cancer cell lines, 37 and fibroblast growth factor 1 (FGF1) plays an important role in the regulation of cell survival, cell division, angiogenesis and cell differentiation. 38 The other three subpathways have also been reported to be associated with the development of breast cancer (see Appendix S1 and Figures S9, S10, S11).

| DISCUSSION
Breast cancer is increasingly recognized as being highly heterogeneous. 1,2 Routine clinical practice, such as the TNM staging system, is not adequate for breast cancer prognosis. However, prognostic assessment is crucial for earlier diagnosis and more personalized treatment. Recently, the identification of molecular signatures with high-throughput biological data was a promising approach to predicting the prognosis of patients with cancer. 5 However, a common problem with using high-throughput data is the "curse-of-dimensionality" problem with many more genes than sample numbers. We overcame this problem by incorporating gene expression data into pathway structure and using a greedy algorithm to search the subpathways associated with the prognosis of patients. Although a number of methods for identifying subpathways associated with cancer have been proposed, 10-15 the advantage of our method is that it considered patient-specific clinical and prognosis information in the identification of subpathways, and constructed subpathway activity profiles for each patient in the data set.
The prognostic value of the four-subpathway signature was verified in the test set and five independent validation sets. The gene expression profiles of these data sets were measured by five different platforms of two chip company (Agilent and Affymetrix) ( Table S1). The four-subpathway signature was found to robustly predict the survival of patients with breast cancer from the above data sets. The good prognosis prediction of the signature in these data sets may have benefited from the construction of patient-specific subpathway activities. We compared the prognostic value of the four-subpathway signature with that of the MammaPrint gene signature, which is one of the most common clinically used gene-based models for breast cancer prognosis prediction. The predictive performance of the four-subpathway signature was superior to that of the MammaPrint signature across the breast cancer data sets used in our study based on the log-rank test and time-dependent ROC analysis ( Figures S6 and S7). The subpathway signature, which incorporates higher-order information of biological pathways, may be the reason for the production of more robust results than the genebased signature.
We also found that there were only three overlapping genes between the MammaPrint signature and our subpathway signature.  30 but this outcome is not always the case. In our stratified analysis, the four-subpathway signature effectively classified patients with ER+ into highand low-risk groups across four data sets ( Figure 3A-D). This result indicates that the subpathway signature may help doctors to identify high-risk patients in the ER+ cohort, which may further lead to tailored treatment. For patients with the ERÀ subtype, which generally have a poor prognosis in clinical practice, 31 a high proportion of these patients was classified into high-risk groups by the prognostic value of the four-subpathway signature, which resulted in a non-significant P in the log-rank test. Thus, a more specific signature for classifying patients of breast cancer with the ERÀ subtype remains to be developed.
We further compared the survival prediction power between clinical characteristics and the four-subpathway signature. The results showed that the four-subpathway signature displayed better predictions of survival than age or tumour size in regard to overall survival across five sets based on time-dependent ROC analysis (Figure 4). Thus, we expect that the four-subpathway signature can assist doctors in clinical practice.
Currently, one limitation of the method is that it classifies samples into high-risk or low-risk group according to the median risk score in the training set, which is not interpretable in clinical settings. This is a common limitation for the current bioinformatics method for identifying prognosis signature based on high-throughput experimental data. Another limitation is that normally prognosis is defined as 5-year survival or 10-year survival; however, the data for all patients with breast cancer with follow-up information at least F I G U R E 5 Heatmap of the expression levels after z-score transformation for the genes involved in the four-subpathway signature in the training set 5 year or 10 year are generally unavailable. Instead, we defined poor or good prognosis according to the status of patient death or not in the study. We believe that the prognosis classification performance of the methods will be considerably improved once we have a more complete survival data. To make our method clinically usable, some essential biological experiments need to be performed to test the association between the four-subpathway signature and pathogenic mechanism of breast cancer.