Development of Prognostic Signature Based on Pan-cancer Proteomics

Background: Utilizing genomic data to predict cancer prognosis was insucient. Proteomics can improve our understanding of the etiology and progression of cancer and improve the assessment of cancer prognosis. Based on CPTAC (Clinical Proteomic Tumor Analysis Consortium) which has generated extensive proteomics data of the vast majority of tumors, we can perform a proteomic pan-carcinoma analysis. Methods: The proteomics data and clinical features of cancer patients were collected from CPTAC. We screened 69 differentially expressed proteins with R software. GO and KEGG analysis were performed to clarify the function of these proteins. The DEPs-based prognostic model was identied by least absolute shrinkage and selection operator (LASSO)-Cox regression model. The time-dependent receiver operating characteristics analysis was used to evaluate the ability of the prognostic model to predict overall survival. Results: A total of 69 differentially expressed proteins were screened in ve different types of cancers: hepatocellular carcinoma (HCC), lung adenocarcinoma (LUAD), children's brain tumor tissue consortium (CBTTC), clear cell renal cell carcinoma (CCRC) and uterine corpus endometrial carcinoma (UCEC). Furthermore, the differentially expressed proteins were related to cell metabolism, cell proliferation and extracellular matrix. Then 24 DEPs-based classiers for predicting OS was developed by LASSO-Cox regression model in training cohort, which was validated in validation cohort. Conclusions: In the present study, we identied DEPs-based survival-predictor model to predict most cancers. We are the rst group to utilize proteomics to construct a pan-cancer prognosis model, which could accurately and effectively predict the survival rate of most cancers.


Introduction
As the most prevalent fatal disease, cancer ranked second in all mortality worldwide in 2017. 1 And the death rate of cancer was increasing year by year, cancer deaths increased from 7.62 million in 2007 to 9.56 million in 2017. In 2018, 18.1 million people worldwide have been diagnosed with various types of cancer. 2 Despite the signi cant progress in treatment, timely diagnosis and high cost of treatment make it impossible to obtain effective treatment, which was still the reason for the low 5-year survival rate of most cancers. 3 In order to develop optimal anti-cancer treatment protocols and elucidate the mechanism of tumorigenesis, it is essential to estimate the prognosis of tumor patients. 4 Although many studies used RNA sequence data from the Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) to evidence many tumor prognostic biomarkers and construct many prognostic models, 5,6 utilizing genomic data to predict cancer prognosis was insu cient and imprecise.
It is widely acknowledged that tumor cells were characterized by rapid generation and abnormal proliferation. Hence, tumor tissues would regulate the expression of proteins and promote the production of proteins associated with cancer progression. 7 Moreover, proteins were the functional effectors of cellular processes as well as the targets for a vast majority of therapeutics. 8 Therefore, the study of proteomics can improve our understanding of cancer aetiology and progression as well as heighten the assessment of cancer prognosis. 9 Although most previous studies have focused on the effects of individual speci c protein on cancer prognosis, [10][11][12] cancer is a heterogeneity disease that does not only involve individual protein but also interactions among proteins of different function. The Clinical Proteomic Tumor Analysis Consortium (CPTAC) project had generated a great deal of proteomics data of the vast majority of tumors by mass spectrometry. 13 Based on the proteomics data from CPTAC, we expect to combine multiple proteins to construct a pan-cancer prognostic model.
In current study, we screened out differentially expressed proteins (DEPs) in ve cancers: hepatocellular carcinoma (HCC), uterine corpus endometrial carcinoma (UCEC), children's brain tumor tissue consortium (CBTTC), lung adenocarcinoma (LUAD) and clear cell renal cell carcinoma (CCRC). Next, we explored the role of the differentially expressed proteins in cancer and the relationships among them. Furthermore, the DEPs-based survival-predictor model was also developed for predicting survival rates for the vast majority of cancers.

Identi cation of DEPs between tumor tissues and adjacent nontumorous tissues
For the proteomic data from CPTAC, background correction, quantile normalization and batch normalization were performed using R software (version 3.6.1). The protein expression values of these ve cancers were normalized by the "sva" package. The bioconductor (http://www.bioconductor.org) package "limma" was employed for DEP screening. A |log2Fold Change|>1 and an adjusted P value < 0.05 were set as cut-off criteria.

PPI network construction
The PPI network of DEPs was performed by STRING (version 11.0) and a combined score > 0.9 (high con dence) was set as the cut-off criterion. Using cytoscape online software (http://www.cytoscape.org/) to visualize the results from STRING.

Construction of DEPs-based classi ers
Based on univariate Cox regression models, we identify single DEP as independent prognostic DEPs for OS with p value<0.05. The least absolute shrinkage and selection operator (LASSO)-Cox regression model 14 was used to identify the most accurate predictive DEPs for OS. The correlation of each prognostic DEPs was performed by R package "ggcorrplot", "statn".
Predictive performance of the DEPs-based classi ers.
The patient's risk score is obtained by multiplying the expression of DEPs in LASSO by their respective coe cients. And the patients were strati ed into two risk-groups by median. The survival was analyzed by the Kaplan-Meier log rank analysis. The time-dependent receiver operating characteristics (tdROC) analysis was used to assess performance of single DEP and classi ers through the "timeROC" package of R software. The area under the curve (AUC) of tdROC re ected predictive accuracy. P-values < 0.05 were considered statistically signi cant.

Data analysis
The Student's t test, Wilcoxon test, and other data processing were completed by SPSS 19.0. Kaplan-Meier analysis is calculated by the "survminer" package of R software. When all the hypotheses are P < 0.05, the difference is statistically signi cant.

Flow chart
The work ow was shown in Figure 1.

GO analysis and KEGG analysis
In order to explore the role of the 69 DEPs in tumors, we conducted GO analysis and KEGG analysis. And the 69 DEPs were mainly associated with the following biological processes: carboxylic acid biosynthetic process, organic acid biosynthetic process, G1/S transition of mitotic cell cycle, cell cycle G1/S phase transition, monocarboxylic acid biosynthetic process, glucose metabolic process, hexose metabolic process, and DNA replication ( Figure 3A). The results also indicated that the DEPs were mainly associated with the following cellular contents: nuclear chromosome part, extracellular matrix, telomeric region and MCM complex ( Figure 3A). Besides, the DEPs were related to molecular functions, such as extracellular matrix structural constituent, carbohydrate binding, helicase activity and monosaccharide binding ( Figure 3A). Similar to GO analysis, KEGG analysis showed the DEPs primarily contributed to the following pathways: Cell cycle, Glycolysis / Gluconeogenesis, DNA replication, Carbon metabolism, Pentose phosphate pathway and Fructose and mannose metabolism ( Figure 3B). Furthermore, combining GO cluster diagram and GO chord diagram, we found that the parts of DEPs involved in DNA replication, Cell cycle and Arginine and proline metabolism were mainly high-expressed, and others associated with these GO terms such as Carbon metabolism and Fructose and mannose metabolism were both highly and poorly expressed ( Figure 3C, D).

DEPs Interaction Clusters Common across Five Cancers
The 69 DEPs were used for the network analysis and almost half the DEPs formed an interaction network after eliminating proteins that acted independently ( Figure 4A). And these interacting proteins were roughly separated into four groups with CDK1, ENO3, Argininosuccinate synthase (ASS1) and Versican core protein (VCAN) as the cores ( Figure 4A, B). CDK1 was observed to be the key hub protein that (1 year AUC=0.437) (Supplementary Figure 2). Although 3 year AUC of PLOD2 reached 0.722, 1 year and 2 year AUC were unsatisfactory. In summary, although the ten proteins can be used as biomarkers of cancer prognosis, none of them could accurately predict OS.

DEPs-based survival-predictor model constructing
For acquiring a more excellent model, multiple DEPs were combined to predict survival rates for cancer patients. We rst conducted univariate Cox analyses in training cohort and found that 33 DEPs related to survival were identi ed ( Figure 5A). Then we used sixty-nine DEPs to perform the LASSO Cox regression model in training cohort. Based on the results of the LASSO Cox regression model, 24 prognostic DEPs with non-zero regression coe cients were nally chosen as the potential prognostic biomarkers for the OS of cancer patients ( Figure 5B, C). The detailed information of DEPs for constructing the prognostic signature was summarized in Table 1 Figure 5D and 5E. Among these proteins, the values of correlation between CDK1 and MKI67, P4HA2 and P4HA1, PGM5 and IL33, PGM5 and DES were all more than 0.5.

Evaluation of the survival-predictor model
Based on the survival-predictor model, we evenly divided cancer patients into two groups by the median risk score cut-off point, which value is 0.250379: High risk and Low risk ( Figure 6A). The patient information was shown in Table 2 and Table 3. Furthermore, the expression heatmap of the 24 DEPs in high risk or low risk group was shown in Figure 6A. We then estimated the accuracy of the 24-DEPs model on predicting survival. The Kaplan-Meier survival curves showed that survival rates were signi cantly lower in the High risk (P <0.001) ( Figure 6B). The ROC analysis showed the one, two, and three years AUC of the 24-DEPs survival-predictor model were 0.764, 0.754, and 0.742 respectively ( Figure  6C). Remarkably, the AUC of the 24-DEPs survival-predictor model was more than the AUC of the ten proteins described above (Supplementary Figure 2). So, compared with a single protein as a predictor, the 24-DEPs survival prediction model had accurate and powerful prediction capability.
In order to further validate the availability of this model, we used the same 24-DEPs survival-predictor model and cut-off point to cluster patients in validation cohort (CBTTC) ( Figure 6D). And the survival analysis also indicated that high risk had a worse OS(P<0.001) ( Figure 6E). The result of the ROC analysis was also satisfactory: one year AUC=0.724 two years AUC=0.689, three years AUC=0.671 ( Figure  6F). In conclusion, the 24 DEPs-based classi ers could accurately predict the survival not only in the training cohort, but also in the validation cohort.

Discussion
As a complex disease, cancer involves not only in DNA alterations, but also in protein expression and modi cation. 15 With technological improvements, CPTAC generates comprehensive mass spectrometrybased proteomic data for most cancers, 13 which providing a unique opportunity for pan-cancerous proteomics research with su cient data.
In current study, we rstly screened 69 differentially expressed proteins in ve types of cancer tissue. More importantly, the expression trend of the DEPs was consistent in all ve cancers, which indicated these proteins were not speci c to any certain type of cancer. Among the DEPs, CDK1 played an important role in progression into mitotic phase, which could drive the cell cycle in all cell types. 16 Previous studies also showed that CDK1 expression was upregulated in a majority of tumor tissues, which correlated with the prognosis of cancer patients. [17][18][19] And MCM2, MCM3, MCM4, MCM5, MCM6, MCM7 formed the MiniChromosome Maintenance 2-7 complex, which was exported by the CDKs to trigger DNA replication. 20 In brief, CDK1 interacted with MCM2-7 complex to participate in the cell cycle, which was the same as the GO analysis and KEGG analysis. Furthermore, we found CDK1, as a key hub protein, interacted with other DEPs to form an interaction cluster. In addition to MCM2-7 complex, other proteins in the cluster also in uenced the growth and division of tumor cells by participating in the cell cycle such as RRM2, PRKAR2B, and MKI67. [21][22][23] Most DEPs related to the cell cycle were up-regulated, which was consistent with the vigorous growth and division of tumor cells. The 69 DEPs were involved not only in the cell cycle, but also in cell metabolism ( Figure 3A, B). Since metabolic reprogramming was a well-established hallmark of cancer, alterations in metabolism-related proteins expression were common in tumors. 24 According to the Figure 4, metabolically related DEPs were roughly divided into two groups: carbohydrate metabolism-related proteins and amino acid metabolism-related proteins. ENO3, FBP1, FBP2, GPD1, and ALDOB were all glycolytic pathway related proteins with inhibitory effects on tumor. [25][26][27][28] For instance, A LDOB disrupted redox homeostasis by reducing the levels of fructose 1,6bisphosphate in tumor cells, which could inhibit tumor cell proliferation. 26 Previous research also showed that although gluconeogenesis was frequently suppressed in tumors, re-expression of gluconeogenesis enzymes such as FBP1 could inhibit tumor growth. 28 As an enzyme responsible for the biosynthesis of arginine in most body tissues, ASS1 was downregulated in multiple diverse cancers to reprogram arginine metabolism to make tumor cells more aggressive. 29 What's more, according to our results, these metabolism-related proteins that inhibit cancer were also down-regulated. But also as a protein related to amino acid metabolism, PYCR1 was highly expressed to maintain the redox balance of tumor cells and prevent apoptosis by synthesizing proline. 30 Despite the DEPs associated with metabolism and cell proliferation, quite a few DEPs were associated with the extracellular matrix. As a large extracellular matrix proteoglycan, VCAN regulated proliferation, invasion, and metastasis adhesion in a vast majority of tumor cells and VCAN expression was associated with poor prognosis in most cancers. [31][32][33] was also an extracellular matrix protein and promoted cell migration and angiogenesis. 34 Distinguished with VCAN and THBS2, though DCN was associated with the extracellular matrix, it could antagonize many tyrosine kinase receptors to inhibit tumor development and progression. 35 According to these results, the four DEPs interaction clusters manifested that one cluster was involved in cell growth and division, one in carbohydrate metabolism, one in amino acid metabolism, and the rest in the extracellular matrix regulation. To summerize, the functions of the 69 DEPs fell into three main categories: cell proliferation and division, cellular metabolism, and extracellular matrix regulation.
In the following step, we performed Kaplan-Meier survival analyses of 69 DEPs one by one and found that only 10 DEPs were signi cantly correlated with survival for multiple cancer. Of the ten proteins, preceding text showed that some studies identi ed RRM2, PLOD2, MKI67, MCM5, and CKD1 promoted cancer progression and FBP1, FBP2, ENO3, GPD1, and ASS1 inhibited cancer progression, which was consistent with our results (Supplementary Figure 1). identi ed to contribute to prognosis of many cancers. 18,25,[40][41][42][43] Although IL33 and EHD3 did not belong to any of the three groups mentioned above, some researches showed that they could inhibit the proliferation of tumor cells. 44,45 In addition to these widely studied proteins, there were still several proteins whose roles in cancer were unclear such as PLCX3, PHYHD1 and PLIN3, which provided a new direction for cancer research. Although no research had yet explored the speci c ways in which they interacted, according to correlation analysis, PGM5 was related to IL33 and DES. Therefore, we inferred that PGM5 may be involved in the regulation of tumor in ammation and extracellular matrix by regulating metabolism. Based on the 24 DEPs-based classi cation, we divided the cancer patients into two groups in training cohort. The Kaplan-Meier survival analysis and the ROC analysis showed that the 24-DEPs survival-predictor model was better predictor than single protein ( Figure 6B, C). We further veri ed the correctness of this grouping method in validation cohort and the two groups also showed signi cantly different survival rates ( Figure 6E). Therefore, the DEPs-based survival-predictor model showed excellent survival prediction effect and is applicable to most cancers, which will contribute to therapeutic decision-making.
Yet, there are several limitations in this study. Firstly, this study mainly explored the effect of the differentially expressed proteins on predicting the OS of multiple cancers. It will inevitably be interesting to combine proteomics with genomics and even metabonomics to predict pan-cancer OS in the future. Secondly, the current study was a retrospective study utilizing the CPTAC database. Therefore, more prospective studies were still needed. Moreover, proteins data of this study were based on clinical specimens, which had limitations for clinical application. It would be clinically valuable, if we could discover tumor biomarkers in various accessible blood samples.

Conclusions
In summary, our study screened 69 differentially expressed proteins in ve cancers. Then we con rmed these DEPs were mainly associated with cell proliferation and division, cellular metabolism and extracellular matrix. According to the LASSO regression method, we have determined 24 DEPs. Notably, the DEPs-based survival-predictor model could accurately predict the OS in multiple cancers. And this is the rst study to utilize proteomics to construct a pan-cancer prognosis model, and the results indicated that the pan-cancer analysis may complement single cancer analysis in the identi cation of prognostically differentially expressed proteins. Availability of data and materials:

List Of Abbreviations
The datasets used during the current study are available from the corresponding author on reasonable request.

Competing interests:
The authors declare that they have no competing interests Funding: This work was not supported by any funding.
Authors' contributions: HQS designed the current study. WGH, WQW and YKX analyzed the data and wrote the manuscript. All authors read and approved the nal version of the manuscript and agreed to be accountable for all aspects of the research in ensuring that the accuracy or integrity of any part of the work are appropriately investigated and resolved.    The work ow of this work.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.