Identification and Validation of Prognostically Relevant Gene Signature in Melanoma

Background. Currently, effective genetic markers are limited to predict the clinical outcome of melanoma. High-throughput multiomics sequencing data have provided a valuable approach for the identification of genes associated with cancer prognosis. Method. The multidimensional data of melanoma patients, including clinical, genomic, and transcriptomic data, were obtained from The Cancer Genome Atlas (TCGA). These samples were then randomly divided into two groups, one for training dataset and the other for validation dataset. In order to select reliable biomarkers, we screened prognosis-related genes, copy number variation genes, and SNP variation genes and integrated these genes to further select features using random forests in the training dataset. We screened for robust biomarkers and established a gene-related prognostic model. Finally, we verified the selected biomarkers in the test sets (GSE19234 and GSE65904) and on clinical samples extracted from melanoma patients using qRT-PCR and immunohistochemistry analysis. Results. We obtained 1569 prognostic-related genes and 1101 copy-amplification, 1093 copy-deletions, and 92 significant mutations in genomic variants. These genomic variant genes were closely related to the development of tumors and genes that integrate genomic variation. A total of 141 candidate genes were obtained from prognosis-related genes. Six characteristic genes (IQCE, RFX6, GPAA1, BAHCC1, CLEC2B, and AGAP2) were selected by random forest feature selection, many of which have been reported to be associated with tumor progression. Cox regression analysis was used to establish a 6-gene signature. Experimental verification with qRT-PCR and immunohistochemical staining proved that these selected genes were indeed expressed at a significantly higher level compared with the normal tissues. This signature comprised an independent prognostic factor for melanoma patients. Conclusions. We constructed a 6-gene signature (IQCE, RFX6, GPAA1, BAHCC1, CLEC2B, and AGAP2) as a novel prognostic marker for predicting the survival of melanoma patients.


Introduction
Among all newly diagnosed primary malignancies worldwide (excluding nonmelanoma skin cancer), 232,100 (1.7%) of which are cutaneous melanoma cases [1]. Skin melanoma causes approximately 55,500 annual deaths globally, accounting for 0.7% of all cancer deaths, which ultimately results from the metastasis of melanoma [2]. Metastatic melanoma in the small intestine is common, as the skin melanoma tends to metastasize to the gastrointestinal tract [3].
The morbidity and mortality of this disease vary according to the time of detection and accessibility to treatment. Melanoma can be roughly divided into chronic sun-damaged (CSD) or nonchronic sun-damaged (non-CSD) melanoma, which refers to long-term exposure to sunlight or nonlong-term damage, respectively. CSD melanoma is usually observed in the elderly (>55 years old) and located on the posterior area of the distal head/neck region. This condition is associated with neurofibrin (NF1), NRAS, BRAF, non-V600E, or KIT-associated mutations, which shows high rates of mutation, while non-CSD melanoma usually affects areas of the body that are more frequently exposed to intermittent sunlight (e.g., the torso), and it is observed in younger individuals (<55 years old) who do not show significant solar elastic tissue disease. Non-CSD melanoma is associated with moderate mutation burdens, including BRAF V600E mutations [4]. Various approaches have been applied in the clinical treatment of melanoma, including surgery, targeting agents, and immunotherapy [5]. Even though significant advances in these treatments have been made, there are still more than 95% of patients with melanoma metastases die within one year [6]. Therefore, there exists an urgent need to identify prognostic biomarkers which can aid clinicians to accurately predict clinical outcome of melanoma and provide a reference for personalized medicine.
In the past few decades, a number of genetic or epigenetic changes have been reported to be associated with the development and progression of melanoma. Multiple driver mutations, such as CDKN2A, BRAF, RAS, GNAQ, PTEN, and TP53 have also been related to the occurrence of melanoma [7]. Mutations in RAS can lead to activation of the receptor tyrosine kinase-MAPK pathway in cancer development and BRAF dysregulation which occurs in melanoma progression and shows a strong correlation with melanoma metastasis [8].
A number of studies have been directed towards identifying predictive survival biomarkers and establishing guidelines for the long-term prognosis of melanoma. These potential markers can mainly be divided into two categories: (1) individual molecules as independent prognostic indicators such as MCAM/MUC18 and/or other novel markers currently under study and (2) analyses of high-throughput gene expression profiles, involving several to dozens of prognostic genes for construction of gene signature [9,10]. There exist several biological methods that can be utilized to identify gene biomarkers associated with melanoma prognosis and construct gene features [11][12][13]. However, the prognosis, diagnosis, and treatment strategies of melanoma still need improving. Accordingly, the purpose of this study is to analyze biological functions of bioinformatics to identify gene signals associated with the prognosis of melanoma. Altogether, our findings will provide new prognostic biomarkers of melanoma.
In order to effectively identify a reliable melanoma prognosis-related gene signature, we obtained the large dataset from the TCGA and GEO databases of melanoma patients. Gene expression profiling, single nucleotide mutations, copy number variation data, and screening of prognostic markers by integrating genomics and transcriptomics data were used to create a 6-gene signature. Verification of survival predictions was achieved through internal test sets and external validation sets. We found that this 6-gene signature was involved with important biological processes and pathways in melanoma. Similar results were obtained from GSEA analysis, suggesting that this 6gene signature can effectively predict the prognosis risk of melanoma and provide a basis for a better understanding of the molecular mechanism of melanoma. In addition, the findings can improve the rational use of precise medications for melanoma.

Data Download and
Preprocessing. TCGA RNA-Seq data from the UCSC cancer browser (https://xenabrowser.net/ datapages/), clinical follow-up information, and copy number variation data for the SNP 6.0 chip were downloaded. A mutation comment file (MAF) was downloaded from the GDC client. GSE19234 and GES65094 expression profile data and clinical follow-up information were downloaded from the GEO database and processed them using the R package "GEOquery" to further standardize the data through scale. Initially, the RNA-Seq FPKM data from TCGA were downloaded. We selected half of the samples as the training set and the remainder as the test set. The random seed was set: seed (0). Both the training and the test sets were processed and using the R package "DESeq2" to further standardize the data through scale. The TCGA training set contained 231 samples, the test set contained 231 samples, the GSE19234 contained 44 samples, and the GSE65904 contained 214 samples (we excluded 4 samples out of 214 for missing information of survival). We collected four datasets on the specific distribution of patient age, survival status, gender, T stage, N stage, M stage, and tumor stage. Table 1 shows the demographic and clinical characteristics of the training and validation sets. The result was analyzed by Student's t-test or chi-square test.

Multigroup Data Preliminary Analysis to Obtain
Prognosis Genes. For TCGA training set samples, a univariate Cox regression analysis was used to establish the relationship between overall survival (OS) and gene expression. In this part, we identified 1569 univariate Cox regression genes with p value less than 0.01 as the candidate prognosis genes.

Copy Number Variation Data and Mutation Data
Analysis. For copy number variation data in TCGA, GISTIC 2.0 was used to identify genes with significant amplification or deletion, as based on parameter thresholds for fragments with amplification or deletion lengths greater than 0.1 and p < 0:05. Mutsig2 was used to identify genes with significant mutations, with the threshold required to be p < 0:05. This analysis resulted in a total of 92 genes with significant mutation frequencies. For amplified and deleted genes recognized by TCGA copy number variation, as well as mutated gene integration, we identified a total of 2286 genes involved in biological processes and pathways.

Determination of Best lncRNA Characteristics.
To identify a gene signature, we integrated 2286 genes with copy number amplification, deletion, and mutation and 1569 prognosis-related genes and then selected the intersection of the two groups as candidate genes, which yielded 141 genes ( Figure 1). The random survival forest algorithm was used to rank order prognosis genes (R package random survival forest). The parameters used were n rep = 100, n step = 5, representing the number of Monte Carlo iterations of 100, and the number of previous progressions was 5 [14]. Genes with relative importance greater than 0.6 were used as the final signature.
The scoring formula for each sample was the sum of the above gene expression values * coefficients. We then selected a sample scoring median of -0.03765742 as a cutoff and divided the samples into high-risk group and low-risk group.

Validation Using TCGA Test Sets and Independent Sets.
In order to determine the robustness of the model, we used the same model and the same cutoff as that used in the TCGA training set and validated these results in the TCGA test set and external independent dataset GSE19234 and GSE65904. Further, we assessed the robustness of the model in all samples of the training set and validation set. The same model and cutoff of the TCGA training set were used to verify these findings in all TCGA datasets.

2.7.
Analysis of the Clinical Independence of the 6-Gene Signature Model. To identify the independence of the 6gene signature model in clinical applications, we used the TCGA training set, the TCGA test set, and clinical information contained within the GSE 19234 and GSE65904 data. A univariate and multivariate Cox regression was used to analyze the relevant HR, 95% CI of HR, and p value. We systematically analyzed the clinical information of TCGA, GSE19234, and GSE65904 patient records, including age, gender, pathology T phase, N phase, M phase, tumor stage, and 6-gene signature grouping information.

Use of GSEA to Analyze Pathways Enriched in High-Risk
Group and Low-Risk Group. GSEA was used to determine significantly enriched pathways in the high-risk group and low-risk group of the TCGA training set. The selected gene set involved c2.cp. v6.2. symbols which contained the KEGG, BIOCARTA, and REACTOME pathways. The GSEA input file contained the expression spectrum data normalized by the TCGA training set and the sample label of the 6-gene signature. The sample label was used to mark the sample as high-risk or low-risk group. The threshold for enriched path selection was FDR q < 0:01, which then enabled the identification of significantly enriched paths as summarized in Table S4. Figure 1 is a flowchart of the model construction process.
2.9. Tissue Samples. The tumor and adjacent normal tissues of melanoma were collected from 10 patients (all participants were older than 16 years), immediately placed in liquid nitrogen, and preserved at -80°C. None of the melanoma patients received preoperative antitumor therapies. Patients and their    2.11. Immunohistochemistry. Each group of melanoma samples was fixed in 10% formalin, embedded in paraffin, and processed as 5 μm continuous sections. Samples were dewaxed with discontinuous concentrations of ethanol and blocked to inhibit endogenous peroxidase. They were then heated in a microwave to retrieve antigens, cooled to room temperature, and blocked by incubation in goat serum for 30 minutes at 37°C. Samples were incubated in rabbit anti-IQCE, anti-RFX6, anti-GPAA1, anti-BAHCC1, anti-CLEC2B, and anti-AGAP2 (Abcam, Cambridge, UK; 1 : 1, 200) overnight at 4°C, followed by incubation with horseradish peroxidase-coupled goat anti-rabbit secondary antibody at 37°C for 30 minutes, and stained by 3,3′-diaminobenzidine. The cell nucleus was stained blue by hematoxylin. Sections were then dehydrated, cleared by xylene, and mounted. IQCE, RFX6, GPAA1, BAHCC1, CLEC2B, and AGAP2 expression was detected by immunohistochemistry using a streptavidin peroxidase method. IQCE, RFX6, GPAA1, BAHCC1, CLEC2B, and AGAP2 expression in liver was taken  BioMed Research International as positive control. Samples incubated with PBS instead of IQCE, RFX6, GPAA1, BAHCC1, CLEC2B, and AGAP2 primary antibody were used as negative control. Positive and negative control groups were included for each batch of immunohistochemically stained sections. The experimental procedure was performed by following the manufacturer's instructions strictly.  Table 1. Following the analysis of the relationship between the patient's overall survival (OS) and gene expression by univariate Cox proportional hazard regression analysis, 1569 univariate Cox regression genes with a p value less than 0.01 were identified. Specific information regarding these candidate prognostic genes, HR of the 1569 candidate prognostic genes, coefficients, z-score, and p value used for determining the prognostic information on the upper 20 genes according to their p value is shown in Table 2 with detailed results presented in Table S1.      Figure 2(b) shows a fragment that was notably absent in the melanoma genome. Table S3 contains records of genes that were notably deleted on each fragment, for example, CDKN2A      BioMed Research International patient samples (Figure 3). Figure 3(a) shows the total number of synonymous and nonsynonymous mutations in 92 genes per patient, and Figure 3(b) shows the number of samples in which 92 genes were mutated. From this analysis, 92 genes were identified, some of which were closely related to the development of cancer, such as PTEN, NRAS, BRAF, TP53, and CDKN2A.

Pathways and Biological Processes Involved in Copy
Number Variant Genes and Mutant Genes. For amplified and deleted genes recognized by TCGA copy number variation, as well as mutated gene integration, we identified a total of 2286 genes involved in biological processes and pathways. As shown in Figure 4(a), 2286 genes were significantly enriched in melanoma, glioma, and breast cancer. PI3K-Akt signaling pathway and natural killer cell mediated cytotoxicity. Figure 4(b) displays the 2286 genes which were significantly enriched in biological processes of cancer development, such as system development, regulation of cellular protein metabolic process, regulation of cell proliferation, and cell-cell adhesion.

Random Survival Forests Rank Order of Prognostic Genes.
We integrated 2286 genes with copy number amplification,

12
BioMed Research International deletion, and mutation and 1569 prognosis-related genes and then selected the intersection of the two groups as candidate genes, which yielded 141 genes. The random survival forest algorithm was used to rank order the prognosis genes (R package random survival forest) using the parameters of n rep = 100 and n step = 5, representing that the number of Monte Carlo iterations was 100 and the number of preprogressions was 5. Genes with relative importance greater than 0.6 were identified as comprising the final signature. Figure 5(a) shows the relationship between the error rate and number of classification trees. Figure 5(b) shows the order of importance of the first six genes out-of-bag.
3.6. Creation of a 6-Gene Signature and Division of Samples in the TCGA Training Set. Six genes were identified and subse-quently used to construct a prognostic gene signature. For the 6-gene signature identified above, information regarding the importance of HR, z-score, and p value of the 6 genes and the relative importance is summarized in Table 3.
A multivariate Cox regression analysis method was then used to establish a 6-gene signature. The model used was  14 BioMed Research International The scoring formula for each sample was the sum of the above gene expression values * coefficients, with the sample scoring median of -0.03765742 selected as a cutoff to divide the samples into high-risk group and low-risk group. Figure 5 shows the classification effect in the TCGA training set. 116 patients were classified within the lowrisk group and 115 patients in the high-risk group (Figure 6(a)). Differences between the two groups were statistically significant (log − rank p = 6:861462E − 08). Figure 6(b) shows the ROC curves with AUCs of 0.79, 0.75, and 0.71 for one, three, and five years, respectively. Figure 6(c) shows that as the patient's risk score increased, their survival time significantly decreased, with this effect being more prevalent in the high-risk group.
As the risk value increased, the expression levels of the six different signature genes changed. According to the above results, high expression levels of IQCE, RFX6, GPAA1, and BAHCC1 were associated with increase in risk factors, while the high expression levels of CLEC2B and AGAP2 were related to protection factors.

18
BioMed Research International classified as low-risk and 117 comprising the high-risk group (Figure 7(a)), with differences between these groups achieving statistical significance (log − rank p = 0:029). AUC values from the ROC curves were 0.54, 0.58, and 0.61 for one, three, and five years, respectively (Figure 7(b)). Figure 7(c) shows that similar results obtained from the TCGA training set. As the risk value increased, survival time significantly decreased, with this effect being more predominant in the high-risk group. As the risk value increased, the expression levels of the six different signature genes changed. High expression levels of IQCE, RFX6, GPAA1, and BAHCC1 were associated with high risk factors, while high expression  We also assessed the robustness of the model (the training set and test set) in all samples. In this assessment, the same model as that of the TCGA training set and the same cutoff was used for verification in all TCGA datasets. The classification effect in the TCGA test set is presented in Figure 6. From this analysis, 230 patients were classified as low-risk and 232 as high-risk patients (Figure 8(a)), with differences between these two groups being statistically significant (log − rank p < 0:001). AUC values from ROC curves were 0.7, 0.72, 0.68, and 0.67 for one, two, three, and five years, respectively (Figure 8(b)). Similar results were obtained from the TCGA training set (Figure 8(c)). As the risk value increased, the survival time significantly decreased, with this effect being more predominant in the high-risk group. Moreover, increased expression of the six different signature genes was observed as the risk value increased. High expression levels of IQCE, RFX6, GPAA1, and BAHCC1 were associated with high-risk factors, while high expression levels of CLEC2B and AGAP2 with low risk and served as protective factors.

BioMed Research International
3.8. Verification of 6-Gene Signature Robustness in the External Independent Dataset GSE19234 and GSE65904. The classification effect in GSE19234 and GSE65904 is shown in Figure 9. 18 patients were classified as low risk and 26 patients as high risk (Figure 9(a)), with differences between these two groups being statistically significant (log − rank p = 0:013). AUC values from ROC curves were 0.54, 0.84, and 0.79 for one, three, and five years, respectively (Figure 9(b)). Similar results were obtained from the TCGA training set (Figure 9(c)). 124 patients were classified as low risk and 86 patients as high risk (Figure 9(d)), with differences between these two groups being statistically significant (log − rank p = 0:031). AUC values from ROC curves were 0.68, 0.59, and 0.59 for one, two, and three years, respectively (Figure 9(e)). Similar results were obtained from the TCGA training set (Figure 9(f)). As the risk value increased, survival time significantly decreased, with this effect being more predominant in the high-risk group. Moreover, increased expression of the six different signature genes was observed as the risk value increased. High expression levels of IQCE, RFX6, GPAA1, and BAHCC1 were associated with high risk factors, while high expression levels of CLEC2B and AGAP2 with low risk and served as protective factors.
3.9. Analysis of Clinical Independence of the 6-Gene Signature Model. We systematically analyzed the clinical information of TCGA, GSE19234, and GSE65904 patient records, including age, gender, pathology T phase, N phase, M phase, tumor stage, and our 6-gene signature grouping information as shown in Table 4.
In GSE65904, the univariate Cox proportional hazard regression analysis revealed that the high-risk group was  In addition, we compared some other known models, by studying three recently published lung cancer prognosis model, such as Wu et al. [15], Brunner et al. [16], and Yang et al. [17]. In order to make the model comparable, we carried out the following work. According to the corresponding gene in the three models, we used the same method to calculate the risk score of each sample in the TCGA and assessed the ROC of each model. In addition, we divided the sample into high-risk group and low-risk group according to the median risk score, and then we calculated the OS prognosis between the two groups. We found that the overall performance of our model was better than that of the above three models, as shown in Figures 10(a)-10(c). The restricted mean survival curves of these models were also compared, as shown in Figure 10(d), from which it can be seen that our model has the highest c-index among the four models. That means that our model has an advantage in long-term survival prediction. Meanwhile, we compared the prediction effect of the 6-gene signature with that of the three models through DCA curves, and the results showed that the performance of our model is better than that of the other three models as indicated in Figure 10(e).

Use of GSEA to Analyze Pathways Enriched in High-Risk
Group and Low-Risk Group. We obtained a significantly enriched path as indicated in Table S4. Some examples of significantly enriched pathways were presented in Figure 11, including cell adhesion molecules cams, JAK-STAT signaling pathway, natural killer cell-mediated cytotoxicity, and T cell receptor signaling pathway. All pathways were significantly related to the development and metastasis of melanoma.

Experimental Verification of the Biomarker Screening
Results with qRT-PCR and Immunohistochemistry Analysis. In order to verify whether IQCE, RFX6, GPAA1, BAHCC1, CLEC2B, and AGAP2 were highly expressed in melanoma tissues as predicted, we experimentally confirmed this by qRT-PCR and immunohistochemical staining using melanoma tissues extracted from 10 patients. The qRT-PCR result is shown in Figure 12; IQCE, RFX6, GPAA1, BAHCC1, CLEC2B, and AGAP2 were all highly expressed in melanoma tissues compared with normal healthy control (p < 0:05, Student's t-test, n = 10). In addition, immunohistochemistry analysis demonstrated that IQCE, RFX6, GPAA1, BAHCC1, CLEC2B, and AGAP2 were highly expressed in melanoma tissues compared with normal tissue (Figure 13).

Discussion
Melanoma leads to 90% of mortalities among all skin cancers [18]. The incidence of melanoma in people over 60 years of age has risen sharply, especially in Europe, and the incidence of melanoma continues to increase [19]. Melanoma is a genetically heterogeneous disorder, and the lesions located in different anatomical locations exhibit different molecular features. The latest research progress in melanoma provides possible method for discrete classifications of melanoma that would consider not only the epidemiology and pathology but also the mutational profiles and other novel biomarkers [20]. Melanoma is highly heterogeneous in terms of prognosis as melanoma patients with same TNM stages show different survival time. Moreover, as melanoma is increasingly detected and treated in the early stages, traditional   Figure 11: Enriched pathways in the high-risk group and low-risk group as obtained in the 6-gene signature. 25 BioMed Research International clinicopathological indicators, such as tumor size, vascular invasion, portal vein thrombosis, and TNM staging have become less effective in predicting individual outcomes. This is especially true for risk stratification, as no "one size fits all" treatment strategy has been proven to be effective [21]. It is clear that screening for prognostic molecular markers that fully reflect the biological characteristics of tumors is critical for individualized prevention and treatment of melanoma patients. In this study, we analyzed the expression profiles of 716 melanoma samples from TCGA and GEO databases as associated with OS. Based on this analysis, a related robust 6-gene signature which is independent of clinical factors was generated and verified.
We assessed the effectiveness of this 6-gene signature using multiomics data, including transcriptome, copy num-ber variation data, and mutation data, to identify diseaseassociated genes. Based on a TCGA dataset containing 462 samples, a potential prognostic six-marker lncRNA was identified. The signature included IQCE, RFX6, GPAA1, BAHCC1, CLEC2B, and AGAP2. Among these lncRNAs, high expression levels of IQCE, RFX6, GPAA1, and BAHCC1 were associated with risk factors. High expression levels of CLEC2B and AGAP2 were found to be protective factors. According to the previous reports, RFX6 can be used as a marker of prostate cancer [22,23]. GAPP1 is closely related to prognosis in gastric cancer, head and neck squamous carcinoma, and hepatocellular carcinoma [24][25][26]. BAHCC1 is closely related to prognosis in hepatocellular carcinoma [27]. CLEC2B is a marker of clear cell renal cell carcinoma and other tumors [28][29][30][31]. AGAP2 is closely related to  Figure 12: The expression of IQCE, RFX6, GPAA1, BAHCC1, CLEC2B, and AGAP2 significantly increased in melanoma tissues compared to normal tissues ( * p < 0:01, n = 10). Ca: melanoma tissues; Normal: normal healthy tissues. 26 BioMed Research International prognosis in gastric cancer and prostate cancer [32,33]. IQCE has not been previously reported to be related to cancer. Ours is the first study to suggest that it can be used as new prognostic markers of melanoma. The 6-gene signature of multiomics data recognition is robust and can achieve stable prediction performance in datasets of different platforms. We systematically analyzed the patient record and clinical information in TCGA, GSE19234, and GSE65904 datasets, including age, gender, pathology, T stage, N stage, M stage, tumor stage, and our 6-gene signature group information. Univariate Cox regression analysis and multifactor Cox regression analysis showed that our multiple omics data of 6-gene signature has strong clinical independence and can maintain stable under the influence of multiple clinical factors.
The classification ability of this characteristic lncRNA was verified on a TCGA test dataset and GSE19234 and GSE65904 datasets. Subsequent analysis supported the conclusion that this 6-lncRNA feature showed reliable prediction accuracy. The enriched pathways in the high-risk group and low-risk group obtained by GSEA analysis of 6-gene signature were significantly related to the occurrence and development of melanoma, suggesting its potential as a prognostic marker for clinical diagnosis. For example, cell adhesion molecules cams, JAK-STAT signaling pathway, natural killer cell-mediated cytotoxicity, and T cell receptor signaling pathway were all related to a variety of tumors [34][35][36][37]. To the best of our knowledge, the prognostic value of this multimarker feature in melanoma has not previously been reported. Therefore, our findings provide new insights into improving risk stratification and survival prediction for melanoma patients.
Although we identified potential candidate genes for melanoma prognosis in large samples through bioinformatic techniques, some limitations of this study should be noted. First, the sample lacks some clinical follow-up information. For example, we did not consider factors such as the presence of other health conditions within these patients to distinguish the prognostic biomarkers. Second, although we have carried out experimental verification, our sample size was not large enough. Therefore, further genetic and experimental studies involving larger sample size and experimental validation are required.

Conclusions
In summary, we constructed a 6-gene signature (IQCE, RFX6, GPAA1, BAHCC1, CLEC2B, and AGAP2) as a novel prognostic marker with respectable AUCs in both the training and validation sets in this study, which was independent of clinical features. Compared with clinical features, this gene classifier can improve survival risk prediction. Therefore, we recommend using this classifier to assess the prognostic risk of melanoma.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that there are no conflicts of interest.