Prognostic score model-based signature genes for predicting the prognosis of metastatic skin cutaneous melanoma

Purpose: Cutaneous melanoma (SKCM) is the most invasive malignancy of skin cancer. Metastasis to distant lymph nodes or other system is an indicator of poor prognosis in melanoma patients. The aim of this study was to identify reliable prognostic biomarkers for SKCMs. Methods: Four RNA-sequencing datasets associated with SKCMs were downloaded from the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) database as well as corresponding clinical information. Differentially expressed genes (DEGs) were screened between primary and metastatic samples by using MetaDE tool. Weighted gene co-expression network analysis (WGCNA) was conducted to screen functional modules. A prognostic score (PS)-based predictive model and nomogram model were constructed to identify signature genes and independent clinicopathologic factors. Results: Based on MetaDE analysis and WGCNA, a total of 456 overlapped genes were identified as hub genes related to SKCMs progression. Functional enrichment analysis revealed these genes were mainly involved in the hippo signaling pathway, signaling pathways regulating pluripotency of stem cells, pathways in cancer. In addition, eight optimal DEGs (RFPL1S, CTSV, EGLN3, etc.) were identified as signature genes by using PS model. Cox regression analysis revealed that pathologic stage T, N and recurrence were independent prognostic factors. Three clinical factors and PS status were incorporated to construct a nomogram predictive model for estimating the three years and five-year survival probability of individuals. Conclusions: The prognosis prediction model of this study may provide a promising method for decision making in clinic and prognosis predicting of SKCM patients.


Introduction
Skin cutaneous melanoma (SKCM) is an aggressive malignancy with the highest mortality in skin cancer. In general, it is characterized by high grade malignancy, and most patients are diagnosed at an advanced stage, which leads to missed therapeutic opportunities [1]. According to cancer statistics in 2020, approximately 95,710 new cases of skin melanoma are diagnosed in US [2]. Although recent checkpoint blockade immunotherapy (such as PD-1/PD-L1 antibodies) and targeted therapy have contributed to medical breakthroughs, the 5-year overall survival rate of metastatic SKCM is still less than 5% due to local recurrence and metastasis [3,4]. Therefore, it is urgent demands to develop more superior reliable biomarkers for early detection and prognosis prediction in SKCMs.
Metastasis of melanoma was associated with activation of several pathways, such as epithelial-mesenchymal transition (EMT), angiogenesis, pericytic mimicry and extravascular migratory [5,6]. However, the genetic and molecular mechanism of metastasis remains unclear. Cutaneous melanoma can metastasize haematogenously or through the lymphatic system, and three exist predominant models can explain the progression, stepwise spread, simultaneous spread and differential spread model [7,8]. Brain metastases are particularly common in metastatic melanoma patients and they can also metastasize to liver, bones, or distant lymph nodes [9]. Nowadays, the increasing application of RNA-sequencing technologies has shown effective methods for understanding tumor metastasis. With the development of gene expression profiling, researchers have identified gene expression signatures that are associated with metastasis or survival outcomes [10,11]. Recently, bioinformatic analysis of public databases has been broadly used to explore prognostic biomarkers in disease progression, and the predictive models can also be utilized to assess prognosis in melanoma patients [12,13]. Thus, the identification of reliable signature genes and clinical factors will provide a guide for decision-making in clinical.
In this study, we comprehensively analyzed the RNA-sequencing profiles of SKCMs by using weighted gene co-expression network analysis (WGCNA) and bioinformatics analysis. Based on prognostic score (PS) model and nomogram model, we screened eight signature genes and three clinical characteristics for prognosis prediction of SKCMs. Our findings might provide reliable prognostic biomarkers in SKCMs.

Data resource
RNA-sequencing profiles of SKCMs were derived from The Cancer Genome Atlas (TCGA) (https: //gdc-portal.nci.nih.gov/) on September 9, 2020 and tested on Illumina HiSeq 2000 RNA sequencing platform. The profile included 473 SKCM tumor samples, of which 458 SKCM tumor samples with metastasis and clinical survival information were selected as training data set.
In addition, microarray data of SKCM were downloaded from the Gene Expression Omnibus (GEO) (http: //www.ncbi.nlm.nih.gov/geo/) based on following screening criteria: 1) transcriptional expression profiles; 2) skin solid tissue samples from SKCM patients; 3) human expression profiles; 4) individuals containing metastasis information; 5) samples counts should be more than fifty; 6) information with survival and prognosis. Finally, we obtained three datasets associated with SKCM, including GSE46517 [14], GSE7553 [15], and GSE65904 [16,17]. The whole testing platform and corresponding usage of these profiles were shown in Table 1. Moreover, annotation of each platform involved in Table 1 were download from Ensembl genome browser 96 databases [18] (http: //asia.ensembl.org/index.html), including RNA types, gene symbol and probe, etc. After re-annotate the detection probes in the expression profile dataset, the corresponding long non-coding RNAs (lncRNAs) and mRNA expression level were finally obtained.

Screening DEGs
MetaDE tool [19,20] in R3.6.1 software was utilized to screened significantly differentially expressed RNAs from three databases of TCGA, GSE46517 and GSE7553. Heterogeneity of gene expression value was tested with parameters of tau 2 , Q-value and Qpval. The tau 2 represents the amount of heterogeneity, and Qpval represents the heterogeneity of data set. If tau 2 = 0 and Qpval > 0.05, it means that the gene was homogeneous and unbiased. Therefore, false discovery rate (FDR) < 0.05, tau 2 = 0 and Qpval > 0.05 were set as thresholds to ensure homogeneity of signature genes. Combining with |log2 fold-change (FC)| parameter, the genes with the same direction of logFC in the three datasets is screened as candidate genes for further analysis.

WGCNA
WGCNA is a method for screening diseases-related modules from thousands of genes, and then incorporating these modules to clinical characteristics [21]. In this study, we used WGCNA version 1.61 packet [22] to screen disease-related modules. Profiles of TCGA were set as training set, while GSE46517 and GSE7553 were set as validation set. All algorithms were conducted following scale-free features. The correlation of co-expression matrix was analyzed and adjacency functions were defined. By the criteria of gene counts ≥ 50 and cut Height = 0.995, we identified the modules associated with disease traits. The stability of modules across differential datasets was calculated based on two validation sets.
The mRNAs screened from MetaDE analysis were corresponding to WGCNA modules to identify overlapped genes significantly associated with disease traits. Hypergeometric algorithm [23] were conducted to calculated enrichment parameter and significance P value of DEGs in each module. Fold enrichment > 1 and P < 0.05 were set as thresholds for hub gene screening. where "N" and "M" represent the whole gene cohorts and module-related genes screened by WGCNA. "n" and "k" represent the whole gene counts and module-related genes screened by MetaDE analysis.

Co-expression network construction
To identify crucial genes related to SKCM metastasis, we calculated the Pearson correlation coefficient (PCC) of DEGs in TCGA training set by using cor function. We constructed the co-expression network for the overlapped genes or lncRNAs screened by MetaDE analysis and WGCNA. Cytoscape 3.6.1 [24] were utilized to visualize the connection of gene pairs. GO and KEGG pathways analysis were conducted by using DAVID version 6.8 [25] tool, and p value less than 0.05 was selected as the threshold.

Identification optimal RNAs for prognostic risk prediction
Cox regression analysis were conducted to screen DEGs associated with survival time in TCGA dataset by using survival package [26]. Log-rank p < 0.05 was selected as a threshold.
Next, cox-proportional hazards (Cox-PH) model [27] in penalized package version 0.9-50 [28] were used to further screen the optimal RNAs associated with SKCM prognosis. The optimal parameter "lambda" in the screening model is obtained through cycling 1000 cross-validation likelihood calculation. PS model were constructed on the basis of least absolute shrinkage and selection operator (LASSO) coefficient and expression value of DEGs in training set. The PS calculation formula is as follows: Prognostic score (PS) = ∑βRNAs × Exp RNAs where βRNAs represents the LASSO prognostic coefficient, Exp RNAs refers to expression level of RNAs in training set.
To validate the performance ability of this predictive model, we calculated PS value of each sample, and divided samples of training set and validation set into high and low risk groups by setting median PS value as criteria. The correlation between risk grouping and overall survival time was assessed by using survival package.

Identify independent prognostic clinicopathologic factors
Univariate and multivariate cox regression analysis were conducted to screen the clinical factors correlated to survival times of SKCMs. Log-rank p value < 0.05 was selected as a threshold. Since then, to further explore the correlation of risk grouping and variables, individuals could be divided into differential subgroups on the basis of clinicopathologic parameters, such as pathologic T, N, recurrence, etc.
In addition, nomogram model incorporating three factors and PS status were constructed by using rms package [29] version 5.1-2 for prediction of 3-and 5-year survival probability of individuals. Nomogram is a graphic scale to visualize the contribution of each factor to survival probability prediction of an individual patient, and it has been widely applied in survival time prediction in multiple cancers [30,31], such as bladder cancer, lung cancer, renal cancer, etc.

Functional analysis of DEGs between high-risk group and low-risk group
The samples in the training set were divided into high-risk and low-risk two groups according to PS model. The DEGs between two groups were screened by using limma package version 3.34.7. FDR < 0.05 and |log2FC| > 1 were considered as screening threshold. Gene set enrichment analysis (GSEA) [32] was carried out to identify potential pathways related to risk, and P < 0.05 was set as criteria. According to gene expression analysis, we screened numerous of differential expressed RNAs from three datasets (TCGA, GSE46517 and GSE7553). After calculated parameters (pval, FDR, tau 2 , Qpval and Qval) of each gene, we finally identified 1463 differential expressed RNAs between primary and metastasis samples, including 11 lncRNAs and 1452 mRNAs. The clustering analysis results of heatmap showed these differential expressed RNAs could be clustered into the same column and rows, indicating these genes might be crucial genes related to metastasis of SKCMs ( Figure 1).

Identification functional modules
While the power value was set as 14, the median connectivity of RNAs satisfied the scale-free network (Figure 2A). CutHeight was chose as 0.995, and three gene hierarchical clustering trees (dendrogram) were generated based on TCGA, GSE46517, and GSE7553 dataset, including 12 co-expression modules ( Figure 2B). A heatmap were generated to visualize the associations of module and clinical traits ( Figure 2C). Four modules with preservation Z score > 5 and P value < 0.05 were identified as stable modules associated with clinical traits ( Table 2). We compared the differential expressed RNAs from MetaDE and WGCNA package, and found 456 overlapped genes with consistent expressed status ( Figure 2D). These genes were mainly enriched in brown and turquoise modules according to the preset thresholds (Fold enrichment > 1, P value < 0.05), including 40 and 50 mRNAs respectively ( Figure 2E). .

Co-expression network construction
We explore the correlation-ships of dysregulated mRNAs or lncRNAs related to SKCM prognosis. By the criteria of absolute PCC value higher than 0.4, we screened 458 pair of connections to construct network ( Figure 3A). These genes were mainly related to 12 BP and 6 KEGG pathways (Table 3, Figure 3B), such as process of transcription, DNA-templated, protein autophosphorylation, DNA-templated, anatomical structure morphogenesis, pathways in cancer, hippo signaling pathway, etc.

Prognostic model for establishing optimal RNAs
The whole samples in training set and validation set were divided into low-and high-risk group after calculating PS value. We screened 41 RNAs significantly correlated to prognosis of disease by univariate analysis. Multivariate analysis showed 14 RNAs with independent prognostic values were identified as hub genes, including 2 lncRNAs and 12 mRNAs. Using cox-proportional hazards model, we final selected 8 RNAs as optimal signature genes associated with SKCM prognosis, such as RFPL1S, CTSV, EGLN3, ESRP1, HESX1, MANEA, PKP1 and PRSS8 (Table 4).  PS value of each sample was calculated based on expression value and LASSO coefficient of eight signature genes. The solution paths and parameters of lasso regression model of 8-genes prognostic risk model were shown in Figure 4. Individuals of dataset can be divided into low-and high-risk groups. Accuracy of risk grouping was further validated by evaluating the correlation between actual survival rate and risk grouping. Patients in high risk groups exhibited a poor prognosis than those in low risk groups, indicating a significant consistency of disease grouping and actual disease prognosis ( Figure 5A,B).

Independent prognostic clinical factors
Three clinical factors (pathologic N, T, recurrence) and PS status were identified as independent prognosis-related factors ( Table 5, Table S1) according to regression analysis. Patients could be divided into differential groups according to three clinical factors. Kaplan-Meier curve demonstrated patients in groups of high-grade pathologic T, N and recurrence statue exhibited a poor prognosis. The results were consistent with actual disease progression ( Figure 6A). Furthermore, individuals could be divided into differential subgroups based on pathologic stage and recurrence status. The patients in differential subgroups were further divided into high and low risk groups according to PS value. Patients in low risk groups exhibited a better prognosis than those in high risk groups ( Figure 6B).

Nomogram model to predict survival probability of individuals
Nomogram model incorporating four variables (pathologic N, T, recurrence and PS status) were constructed to predict 3-and 5-year survival probability of individuals (Figure 7). The total score of each factor corresponded to an estimated survival rate in nomograms. "Total Points" axis in the first row incorporates the four variables to predict prognosis of patients. Samples of TCGA set were divided into two groups after calculating PS values. We finally obtained 662 DEGs (including 427 downregulated and 235 upregulated genes) between the high-risk and low-risk groups, and the volcanic diagram was shown in Figure 8A. The heatmap incorporate the gene expression level and Risk Score to visualize the distribution of crucial genes ( Figure 8B).

Screening of crucial genes DEGs between high-risk group and low-risk group
GSEA results revealed these DEGs were mainly related to eight pathways, such as melanogenesis (Table 6, Figure 9), JAK-STAT signaling pathway, cytokine receptor interaction, neuroactive ligand receptor interaction, etc.

Discussion
In this study, a total of 1463 DEGs were screened between primary tumor and metastatic samples based on three datasets, TCGA, GSE46517 and GSE7553. We conducted WGCNA and identified two stable functional modules. Functional enrichment analysis revealed module-related DEGs were associated with pathways in cancer, hippo signaling pathway, regulating pluripotency of stem cells. After PS model and nomogram model construction, we identified 8 optimal signature RNAs (such as lncRNA-RFPL1S, CTSV, ESRP1, etc.) and three independent clinical factors (Pathologic N, T and Recurrence) for prognosis prediction of individuals.
In this study, a lncRNA-mRNA network was constructed by differentially expressing lncRNAs and DEGs. In order to elucidate the underlying mechanisms, pathway enrichment analysis was performed on these lncRNAs and DERs. The results of the study showed that these DEGs were significantly associated with the Hippo signaling pathway, Signaling pathways regulating pluripotency of stem cells and Pathways in cancer. The Hippo pathway is a critical regulator of organ growth and cell fate that is dysregulated in many cancers [33]. Cancer malignancy has been linked to distinct subsets of stem-like cells, known as cancer stem cells, which persist during treatment and appear to lead to drug-resistant recurrence [34]. Besides, melanomas develop from melanoma-competent melanocyte stem cells in response to UVB stimulation, which causes melanocyte stem cell activation and translocation through an inflammation-dependent process [35]. These findings indicate that the Hippo signaling pathway, Signaling pathways regulating pluripotency of stem cells and Pathways in cancer are important for SKCM, and that disease-related RNAs RNAs are involved in the regulation of these pathways in SKCM.
We explored the potential biological function of eight signature genes by searching the published papers. IncRNAs are types of RNAs molecules without protein encoding function, and have been reported involved in the tumorgenesis and development. LncRNAs-RFPL1S is antisense to RFPL1 gene and may function in the post-transcriptional regulation of this gene [36]. RFPL transcripts encode proteins with tripartite structure and it has been reported regulating cell-cycle progression through cyclin B1/Cdc2 degradation [37]. A recent study showed RFPL1S were down-regulated in brain tumors samples and expression patterns of this lncRNA were correlated with malignancy grade in gliomas [38].
CTSV or cathepsin L2 is a lysosomal cysteine proteinase that associated with extracellular-matrix degradation and tumor progress. High expression of CTSV is reported to involved in metastasis and poor prognosis in breast ductal cancer, and it is a potential biomarker for prognosis prediction breast cancer [39]. Moreover, two gene CTSV and CTSC were involved in invasion of RCC, and a recent study showed praeruptorin B could promote the metastatic ability of RCC cells through targeting CTSC and CTSV expression [40]. In addition, ELGN3 or PHD3 is a member of PHDs, which induced in hypoxia. The biological role of PHD3 was related to HIF-1α hydroxylation, suppression of tumor angiogenesis under hypoxic conditions [41,42]. Further study demonstrated the tumor inhibition role of EGLN3, and it could regulates p53 stability by hydroxylating proline 359 and resulted to cell cycle arrest and apoptosis [43]. In murine melanoma model, treatment of PHD3 inhibitor could decreases tumor growth and angiogenesis through increasing sVEGFR-1 generation from tumor-associated macrophages, indicating a critical role PHD3 in tumor progression [44]. Whereas, PHD3 might function as the tumor promotion factor in some tumors. By integrative -omics and HLA-ligand omics analysis, Anna identified that EGLN3-derived peptides were associated with higher infiltration of tumor by CD8+ T cells; functional analyses revealed EGLN3 might play the role of pro-proliferative and anti-apoptotic in several RCC cell lines [45]. Moreover, EGLN3 were also elaborated that related to chemotherapy resistance of cancer, such as prostate cancer and pheochromocytoma [46,47].
Furthermore, protein of ESRP1 belong to RNA-binding proteins family and it involved in tumor progression through regulating of gene posttranscriptional process [48]. Studies have shown that downregulation of ESRPs was associated with tumor infiltration in various type of cancers, including prostate cancer, breast cancer and pancreatic cancers [49,50]. Recently, ESRP1 was identified as potential prognostic biomarker in cutaneous malignant melanoma [51], and patients with higher ESRP1 level had a poor overall survival rate than those with lower ESRP1 level; the further analysis showed ESRP1 was negatively associated with infiltration of DC and Treg cells. It is well known that DC can promote tumor metastasis by up-regulating Treg cells and down-regulating the cytotoxicity of CD8+ T cells [52]. In fact, data analysis in our study also identified dysregulated ESRP1 from metastatic SKCM sample, which consistent with previous studies. In addition, PKP1/2/3 played an potential role in tumor invasion and metastasis in various malignancy and PKP1 mutation could result in skin fragility syndrome [53,54]. Lee et al reported that phosphorylation of Pkp1 by RIPK4 (receptor-interacting serine-threonine kinase 4) regulated epidermal differentiation and skin tumorigenesis [55]. Moreover, the serine protease PRSS8 could suppress colorectal carcinogenesis and metastasis [56]. Hypermethylated PRSS8 in ESCC tissues was linked to the downregulation of PRSS8; The reduction of PRSS8 was well correlated with shorter survival time in ESCC patients [57]. High levels of PRSS8 and prostasin has been identified as potential clinical biomarkers for ovarian cancer early detection [58]. Taken together, these findings indicated that protein of CTSV, ESRP1, ELGN3 might be critical genes associated with tumor metastasis, and the eight genes may be reliable biomarkers for predictive of SKCM prognosis.
Univariate and multivariate cox regression analysis revealed three clinicopathological variables and PS status were correlated to prognosis of SKCMs. Nomogram model incorporating these factors to predict 3-and 5-year survival probability of individuals. Previous studies showed this model has been widely used to provide prognosis prediction for cancer patients. A recent study showed six clinical factors were significantly correlated with sentinel node status, which is major factor of melanoma prognosis; six variables-based nomogram model exhibited a good predictive performance and suggested to be a decision aid for T1 melanoma being considered sentinel node biopsy [59]. Using nomogram model, another paper identified five immune-related genes and several clinical parameters for prognosis prediction of SKCMs [60], such as pathological T, N, AJCC stage etc.. In our study, we selected common clinical factors of SKCMs, and these clinicopathologic parameters were relative easily available and comprehensive in clinical work. Moreover, our nomogram exhibited a better discrimination ability for predicting prognosis.
The potential limitation of this study should be enumerated. Firstly, individual number was small and the whole data analysis were conducted on the basis of public databases (TCGA and GEO), thus more external samples from multiple medical centers should be concerned validate the performance of prediction models. Secondly, the function of optimal RNAs also needed to be further investigated in SKCM not only by data analysis but also by molecular and cellular experiments.

Conclusions
In conclusion, we screened cohorts of overlapping genes related to SKCM metastasis by using DEGs analysis and WGCNA. On the basis of PS model, we identified eight signature genes and pathologic N, T and recurrence as prognostic factors for disease prediction. Furthermore, eight signature genes based predictive model and nomogram might be useful methods for prognostic prediction of SKCM patients.