Construction of a protein-based model for prognosis prediction of kidney renal clear cell carcinoma: an investigation based on functional proteomics data

Several studies have shown prognostic value of gene-based models at mRNA level in kidney renal clear cell carcinoma (KIRC). However, protein-based models for prognosis prediction of KIRC are rarely reported, then we conduct this study. Proteomics and clinical data of KIRC were acquired from The Cancer Proteome Atlas (TCPA) and The Cancer Genome Atlas (TCGA), respectively. Prognosis-associated proteins were screened by univariate Cox regression analysis and log-rank test. Patients were grouped into training set and testing set. The protein-based prognostic model was constructed by lasso Cox regression analysis in training set and validated in test set and whole set. over-expressed


Results
A ve-protein model (ACC1, IGFBP2, MIG6, PEA15 and RAD51) was constructed for KIRC prognosis prediction. It could well classify KIRC patients into high-risk and low-risk group with signi cantly different survival. The results was validated in the testing set and whole set. Age, AJCC stage and risk score based on the ve-protein model were identi ed as independent prognostic parameters and they were used to construct a nomogram. The calibration plot showed the nomogram had good agreement between predicted and actual outcomes. Time-dependent ROC curves revealed the nomogram performed best in predicting the 1-year, 3-year and 5-year overall survival (OS) compared with other independent prognostic parameters. DCA demonstrated the nomogram had obviously clinical net bene t. Furthermore, several cancer-related biological signaling pathways were enriched in the functional enrichment analysis.

Conclusion
Our study developed an effective protein-based model to predict the OS of KIRC, which may help clinicians to offer individual treatment.

Background
Kidney cancer is one of the most lethal urological diseases globally. Kidney renal clear cell carcinoma (KIRC) is the highest mortality, invasion, and metastasis subtype, which accounts for about 70-75% of kidney cancer [1,2]. With the increasing incidence and mortality in recent years, KIRC remains a big challenge worldwide [3]. It was reported that the newly diagnosed cases in the USA had increased to 73,820 in 2019 and resulted in nearly 14,770 deaths [4]. Approximately 15% RCC patients have progressed into distant metastasis at clinical diagnosis, and the 5-year survival rates are less than 10% in patients with metastasis [5]. Thus, accurate prognostic models are urgently needed to distinguish the high-risk patients with poor survival. Conventional parameters such as TNM stage are not able to provide enough accurate predictions because of the great heterogeneity of KIRC and the predictive ability is far from satisfying [6].
In recent years, with widespread use of high-throughput sequencing technology to screen biomarkers of cancer, several studies have shown the great potential of gene-based model at mRNA level in predicting the survival of KIRC [7][8][9]. Since lots of gene regulation happens post-transcriptionally, functional proteomics still represents a powerful approach to reveal the pathophysiology of cancer. However, protein-based models for prognosis prediction of KIRC are rarely reported. Our study aimed to identify a prognostic model base on proteins for KIRC using data from The Cancer Proteome Atlas (TCPA).
TCPA is a comprehensive, cutting-edge, protein-centric pan-cancer analytic platform which is freely available at http://tcpaportal.org. Using the reverse-phase protein arrays (RPPAs) platform, a technology that can quantitatively assess lots of protein markers in thousands of samples in a sensitive, costeffective and high-throughput manner, approximately 8,000 tumor samples across 32 cancer types through The Cancer Genome Atlas (TCGA) project [10,11]. TCPA has become a vital bioinformatics resource because of the advantage of quantitative protein expression and survival data from TCGA patient.

Data extraction
We downloaded the protein expression data of 445 KIRC patients from TCPA database (http://tcpaportal.org) and the clinical information of 537 KIRC patients from TCGA database (https://portal.gdc.cancer.gov/), respectively. Next, we matched the patients' clinical information and protein expression data and removed the patients with a follow-up period less than one month. Finally, 433 patients with protein expression data and survival information were identi ed. Identi cation of prognosis-related proteins Expression data of 233 proteins were included in each patient. Univariate Cox regression analysis and log-rank test were used to screen the prognostic proteins.
Construction of the protein-based prognostic model 433 patients were randomly grouped into a training set and testing set. The training set was applied to construct the Cox regression hazards model, while the testing set and the whole set was applied to validate the performance of the model. The potential risk proteins were identi ed by Lasso regression analysis and then they were used to construct a protein-based prognostic model. Patients' risk score was calculated by the following computational formula: Risk score = Coe cient (protein 1) x Expression value of protein 1 + Coe cient (protein 2) x Expression value of protein 2 + ⋯ + Coe cient (protein i) x Expression value of (protein i). "Protein i" mean the i th selected protein, and "coe cient (protein i)" mean the corresponding regression coe cient [12]. The optimal cut-off value of risk score was investigated by the R package "survminer" [13]. Patients were separated into high-risk and low-risk group base on the cutoff value. The higher risk score means worse survival for KIRC.
Validation of the protein-based prognostic model The risk score of each patient in the testing set and whole set was calculated with the same prognostic model. Then the ROC analysis, Kaplan-Meier (KM) survival analysis was performed to validate the model's predictive power. Besides, the protein expression level of the prognostic proteins was explored in the Human Protein Atlas online database (http://www.prote inatl as.org).
Independence prognostic value of the protein-based model The independent prognostic value of protein-based model was assessed by the methods of Univariate Cox regression analysis, multivariate Cox regression analysis and strati ed analysis. The association between the risk score and clinicopathological features was also explored.
Constructing and estimating a predictive nomogram All of the independent prognostic parameters identi ed were used to construct a composite nomogram. The calibration plot of the nomogram was applied to check the consistency between predictive and actual outcome of KIRC. Time-dependent ROC curves were plotted to compare the predictive accuracy and decision curve analysis (DCA) was used to compare the clinical net bene t among nomogram and other independent prognostic parameters.

Functional enrichment analysis
We performed the correlation analysis to identi ed the co-expressed proteins with the prognostic proteins in the model. The correlation coe cients > 0.4 and P value < 0.001 was considered signi cantly. Then gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways analysis were conducted on Metascape (http://metas cape.org/gp/ index .html#/main/step1 ).

Statistical analysis
All preprocessing processes and statistical analysis were realized by R software v3. 6

Identi cation of prognosis-related proteins
The study was conducted as shown in the owchart (Fig. 1). We identi ed 77 proteins signi cantly related to overall survival (OS) of KIRC including 35 proteins with HR > 1 and 42 proteins with HR < 1 ( Fig. 2A). It mean the high expression of proteins was associated with poor survival of patients when HR > 1.

Construction of a ve-protein prognostic model in the training set
Totally 433 patients with a follow-up period longer than one month were randomly divided into training set (n = 216) and testing set (n = 217). There was not signi cant difference in clinical characteristic between two groups (Table 1). Lasso Cox regression analysis were performed in the training set ( Fig. 2B) and ve risk proteins were successfully identi ed ( Table 2). The proteins were acetyl-CoA carboxylase 1 (ACC1), insulin-like growth factor binding protein 2 (IGFBP2), mitogen-inducible gene 6 (MIG6), proliferation and apoptosis adaptor protein 15 (PEA15) and RAD51 recombinase (RAD51). Finally, a protein-based prognostic model was established by the ve proteins. The risk score = 0.438 * Expression of ACC1 + 0.251 * Expression of IGFBP2-0.229 * Expression of MIG6 + 0.283 * Expression of PEA15 + 0.348 * Expression of RAD51. According to the optimal cut-off value of risk score, patients were divided into high-risk and low-risk group (Fig. 3A). More dead patients corresponded to the higher risk score in the training set (Fig. 3D). Heatmap of the ve prognostic proteins presented an obviously different expression between the high-risk and low-risk group (Fig. 3G). The KM curve show that patients in high-risk group suffered a signi cantly poorer OS than patients in low-risk group (p < 0.001) (Fig. 4A). The area under the ROC (AUC) for 1-year, 3-year and 5-year OS were 0.789, 0.753, and 0.785, respectively. (Fig. 4D). Taking together, the results indicated that our ve-protein model had a well performance in survival prediction for KIRC.  Validation of the ve-protein prognostic model in the testing set and whole set Validation was conducted in the testing set and whole set with the same method above. Patients were also divided into high-risk and low-risk group with the same cut-off value (Fig. 3B-C) and More death events occurred in the high-risk group (Fig. 3E-F). Different expression of ve proteins was also demonstrated between high-risk and low-risk group (Fig. 3H-I). Meanwhile, the KM curve show the OS was signi cantly poorer in the high-risk group than that in the low-risk group (both p < 0.001) (Fig. 4B-C). The AUC were 0.730, 0.705, 0.725 and 0.763, 0.732, 0.755 for 1-year, 3-year and 5-year OS in the testing set and whole set, respectively (Fig. 4E-F). All of the results were consistent with that in the training set which suggested the reliability of our model.
Moreover, the expression of the ve proteins was also validated in the Human Protein Atlas online database. The immunohistochemistry (IHC) staining show a higher expression of ACC1, PEA15 and RAD51 in tumor tissue than that in normal tissue (Fig. 2C). However, data of IGFBP2 and MIG6 was not found.
Independent prognostic value of the ve-protein model in the whole set Univariate Cox regression and multivariate Cox regression analysis were performed to detect whether the risk score calculated by our ve-protein model had independent prognostic value from other clinicopathological parameters including age, gender, grade and American Joint Committee on Cancer (AJCC) stage. Univariate Cox regression analysis indicated that age, grade, AJCC stage and risk score were related to the OS of KIRC (Fig. 5A). However, only age, AJCC stage and risk score were con rmed as independent prognostic factors by multivariate Cox regression analysis (Fig. 5B). Moreover, strati ed analysis revealed that patients in high-risk group had signi cantly poorer OS than patients in low-risk group according to the age, gender, grade and AJCC stage (all p < 0.001) (Fig. 6A-D). Furthermore, the risk score was also signi cantly increasing with the grade (p < 0.001), AJCC stage (p < 0.001) and age (p = 0.033) progress, but not with gender (p = 0.41) (Fig. 6E-H).

Constructing and estimating the protein-based nomogram in the whole set
In order to construct a more reliable predictive model for clinical practice, a composite nomogram was established including three independent prognostic factors (age, AJCC stage and risk score) (Fig. 5C). The calibration plot indicated that the 1-year, 3-year and the 5-year OS predicted by nomogram was in good agreement with the actual outcome ( Fig. 7A-C). The AUC of nomogram was 0.855, 0.817 and 0.799 for 1-year, 3-year and 5-year OS, respectively, which were larger than that of age, AJCC stage and risk score (Fig. 7D-F). The DCA demonstrated the nomogram could improve clinical net bene t especially for 5-year OS (Fig. 7G-I). Taking together, the results suggested that our protein-based nomogram could perform best in predicting the OS for KIRC.

Functional enrichment analysis
Correlation analysis show that most of the co-expressed proteins were associated with RAD51 (Fig. 8A). Several cancer-promoting proteins were positive related to the ve proteins in the model such as TIGAR and EEF2 (Fig. 8B). Apoptotic signaling pathway, p53 signaling pathway and PI3K-Akt signaling pathway were enriched by the functional enrichment analysis of the co-expressed proteins (Fig. 8C-D).

Discussion
It is still a big challenge globally for the treatment of kidney cancer due to the high incidence and mortality. TNM stage is the most important prognostic predictor for cancers including KIRC, but the clinical outcomes can differ greatly even among patients with the same stage because of the heterogeneity of KIRC [14]. Therefore, novel biomarkers are necessary for the establishment of more accurate prognostic models. Recently, more and more proteins have been reported to be an independent prognostic factors for KIRC, but none study has focused on protein-based models for prognosis prediction of KIRC [15][16][17].
In our study, we rstly screened the prognosis-related proteins with the data form TCPA. Next, a veprotein model (including ACC1, IGFBP2, MIG6, PEA15 and RAD51) was developed and assessed for its prognostic value in KIRC. The risk score calculated by the model was an independent prognostic factor for KIRC and it could e ciently divide patients into high-risk and low-risk group whose OS were signi cantly different. The consistent results were also obtained in the subgroup analysis based on age, gender, grade and AJCC stage. Furthermore, the higher risk score was signi cantly associated with higher grade, AJCC stage and age. Additionally, in order to further improve the predictive power of our veprotein model, a nomogram integrating risk score and other independent prognostic parameters (age and AJCC stage) was constructed. The AUC of the constructed nomogram were 0.855, 0.817 and 0.799 for 1- year, 3-year and 5-year OS, respectively, larger than that of any parameters contained in nomogram. An AUC > 0.60 was regarded as acceptable for predictions, and an AUC > 0.75 was deemed to have excellent predictive value [18,19]. Thus, our nomogram had a perfect predictive power. Additionally, the calibration plot of the nomogram revealed satis ed agreement between the predicted and actual outcome. Finally, the DCA indicated the nomogram could improve clinical net bene t for KIRC. Collectively, these ndings suggested our ve-protein model could effectively distinguish high-risk patients with poor survival and the protein-based nomogram could accurately predict the OS for KIRC which might help clinicians make best clinical decisions.
Cancer-related proteins play a vital role in the survival of patients with cancer, all the ve proteins in our risk model were reported to be involved in cancer. ACC1, IGFBP2, PEA15 and RAD51 are negative prognostic proteins with HR > 1, while MIG6 is a positive prognostic protein with HR < 1. ACC1 is a kind of lipid metabolism enzyme and converts from acetyl CoA to malonyl CoA which is the rate limiting step in de novo fatty acid synthesis (FASyn) [20]. Lots of cancers demonstrate an increase in FASyn and decrease of FASyn inhibits the tumor growth [21,22]. It is observed an overexpression of ACC1 in several kinks of cancers, including breast cancer, lung cancer and prostate cancer [23,24]. Higher ACC1 level is signi cantly related to poor survival in patients with lung cancer [25]. However, the role of ACC1 in KIRC is still rarely reported. IGFBP2 belongs to IGFBP family and functions as a carrier of IGF-I and IGF-II [26]. Several studies have reported that IGFBP2 can promote tumorigenesis, tumor angiogenesis and metastasis [27,28]. Overexpression of IGFBP2 is found to be associated with metastasis and poor survival in lung cancer [29]. Furthermore, IGFBP2 is reported to play important role in renal cell carcinoma metastasis [30]. Thus, IGFBP2 may be a potential target for anticancer therapy. MIG6 is also named epidermal growth factor receptor (EGFR) feedback inhibitor 1, functioning as a negative regulator of EGFR signaling pathways [31]. Downregulation of MIG6 is found in a variety of cancers including breast cancer, lung cancer, papillary thyroid cancers and endometrial carcinoma and is closely related to poor prognosis [32][33][34][35]. Another study shows that MIG6 may suppress the epithelial proliferation through the inhibition of AKT activation [36]. Functional enrichment analysis in our study also revealed that PI3K-Akt signaling pathway was involved in KIRC, but the actual functions of MIG6 in KIRC remain unknown. PEA15 is a member of the anti-apoptotic family and it functions through ERK1/2 binding and deatheffector domain [37]. It is reported that PEA15 promotes liver metastasis of colorectal cancer by ERK/MAPK signaling pathway [38]. Meanwhile, PEA15 can contribute to the AKT-regulated cisplatin resistance and is associated with the survival in gastric cancer [39]. Whether PEA15 plays a similar role in KIRC remains unknown. RAD51, the central homologous recombination protein, is important for the cellular response to DNA damage [40]. It participates in several cellular processes including apoptosis which is the top 1 of GO enrichment analysis in our study [41]. Recent studies revealed that RAD51 was overexpression in many cancers including pancreatic adenocarcinoma, breast cancer and lung cancer [42][43][44]. Evidence also indicated that RAD51 was associated with tumor resistance to chemotherapy and survival [43,45]. Yet, the relationship between RAD51 and KIRC remain unexplored.
As the ve-protein model established in our study is based on protein expression level without any posttranscriptional regulation, it may be more cost-effective and reliable in clinical practice. Although our protein-based model shows a good performance in the survival prediction for KIRC, it is worth noting that several limitations still existed. Firstly, the ve-protein model was generated base on the data from TCPA, in which most patients were Asian or White. Secondly, our prognostic model was not validated in external cohort. Therefore, it is necessary to validate our model in multiple centers with different populations. Finally, since most of proteins contained in the model are rarely explored in KIRC, rigorous basic experiments are indispensable.
In conclusion, our study developed an effective protein-based model to predict the OS of KIRC, which may help clinicians to offer individual treatment.

Declarations
Ethics approval and consent to participate Not required.

Consent for publication
Not applicable.       Strati ed survival analysis of the ve-protein model and the association between the risk score and clinicopathological features. Strati ed survival analysis according to stage (a), grade (b), age (c) and gender (d). The association between the risk score and stage (e), grade (f), age (g) and gender (h).