Six-gene signature for predicting survival in patients with head and neck squamous cell carcinoma

The prognosis of head and neck squamous cell carcinoma (HNSCC) patients remains poor. High-throughput sequencing data have laid a solid foundation for identifying genes related to cancer prognosis, but a gene marker is needed to predict clinical outcomes in HNSCC. In our study, we downloaded RNA Seq, single nucleotide polymorphism, copy number variation, and clinical follow-up data from TCGA. The samples were randomly divided into training and test. In the training set, we screened genes and used random forests for feature selection. Gene-related prognostic models were established and validated in a test set and GEO verification set. Six genes (PEX11A, NLRP2, SERPINE1, UPK, CTTN, D2HGDH) were ultimately obtained through random forest feature selection. Cox regression analysis confirmed the 6-gene signature is an independent prognostic factor in HNSCC patients. This signature effectively stratified samples in the training, test, and external verification sets (P < 0.01). The 5-year survival AUC in the training and verification sets was greater than 0.74. Thus, we have constructed a 6-gene signature as a new prognostic marker for predicting survival of HNSCC patients.


INTRODUCTION
Head and neck cancer is a highly heterogeneous malignant tumor, which can originate from various anatomical sites in the upper airway and digestive tract, including the mouth, larynx and pharynx. Most cases of head and neck cancer are squamous cell carcinoma (HNSCC), which accounts for about 4% of all new cancer diagnoses in the United States. Worldwide, there are about 600,000 new head and neck cancer patients each year [1], and the 5-year survival rate is only 40-50% [2]. The main reason for this high mortality is the high rate of diagnosis of advanced cancers, as the survival rate among advanced cancer patients is only 34.9% [3]. There is thus an urgent need for markers to help clinicians make accurate early HNSCC diagnoses, predict clinical outcomes, and provide reference for individualized medicine.
Many studies have been carried out in an effort to find predictive biomarkers to establish guidelines for the long-term prognosis of HNSCC patients. These biomarkers can be divided into two categories: 1) single molecules such as squamous cell carcinoma antigen (SCC-A), human papilloma virus (HPV), or any of the other new markers currently being studied; and 2) gene markers identified by analyzing high-throughput gene expression profiles and constructed using from several to dozens of prognostic genes. Several systems biology methods are currently being used to identify gene biomarkers related to the HNSCC prognosis and to characterize those genes [4][5][6]. For example, Tian et al. used weighted gene correlation network analysis and least absolute shrinkage and selection operator Cox regression to identify a 6-lncRNA signature within a gene expression profile. De et al. used gene expression meta-analysis to identify a 172-gene signature. And Zhao et al. used protein-protein interaction network analysis and Cox regression analysis to identify a 3gene signature. All three authors tested their genetic signature using independent external data sets. However, the AUC for Zhao et al's genetic signature is not high (AUC=0.6), which means that identifying robust lncRNA signatures is still a challenge, and more investigation will be required to verify signatures. In other words, there is still an important need to identify new gene signals related to the prognosis of HNSCC through bioinformatics analysis of their biological functions.
To effectively identify a reliable gene signature associated with prognosis in HNSCC, we proposed a systematic pipeline to identify HNSCC-related gene markers. This approach enabled us to identify a 6-gene signature that can be used to effectively predict prognostic risk in HNSCC patients and provide a basis for better understanding of the molecular mechanisms underlying the prognoses of HNSCC patients.

Identification of gene sets associated with total patient survival
For The Cancer Genome Atlas (TCGA) training set samples, we used single factor regression analysis to establish the relationship between overall survival (OS) and gene expression. We identified 425 single factor Cox regression logrank P values less than 0.01. Of those, 141 genes were associated a hazard ratio (HR) > 1, and 284 with a HR < 1. The 20 genes with the highest HRs are listed in Table 1.
Using TCGA mutation annotation data with Mutsig2, we identified genes with significant mutations. A total of 302 genes with significant mutation frequencies were detected (Supplementary Table 2). The most significant types mutations in the Top 50 genes were synonymous mutations, missense mutations, frame insertions or deletions, frame movement, nonsense mutations, distribution of shear sites, and other non-synonymous mutations ( Figure 1C). It was clear that there were differences in the mutations to different genes, including B2M, CDKN2A, PTEN, TP53 and FBXW7. A common feature, however, is that mutation of these genes is reported to be closely related to the occurrence and development of tumors [7][8][9][10].

Functional analysis of genome variant genes
To analyze the function of genomic variant genes, we integrated 1321 amplified or deleted genes and significantly mutated genes identified based on copy number variation. Gene Ontology (GO) biological process and Kyoto Encyclopedia of Genes and Genomes (KEGG) functional enrichment analyses were  Figure 2A). GO terms such as developmental process, positive regulation of cellular process, cell differentiation, and regulation of localization were mainly enriched in the "biological process" category ( Figure 2B). These terms were also closely related to the occurrence and development of cancer; that is, these genes exhibiting genomic variation are closely related to cancer.

Identification of a 6-gene signature for head and neck cancer survival
To identify a gene signature, we integrated genomic variant and prognosis-related genes, and then selected the intersection of the groups as candidate genes, which yielded 36 genes. We then used random forests for feature selection. The relationship between error rate and number of taxonomic trees was used to reveal genes with relative importance greater than 0.4 as the final signature ( Figure 3A). Ultimately, we identified 6 genes ( Table 2). The important order of the out-of-bag scores for the 6 genes is displayed in Figure 3B.
The risk score of each sample was calculated, and the samples were grouped according to the median risk score (cutoff = -0.0236503). The prognoses of the highrisk and low-risk groups significantly differed ( Figure  3C). The average 1-, 3-, and 5-year AUC for the 6-gene signature was 0.75 ( Figure 3D). High expression of PEX11A, NLRP2, SERPINE1, UPK2 and CTTN was associated with high risk, while high expression of D2HGDH was associated with low risk and was a protective factor.

Verification of the robustness of the 6-gene signature model
To verify the robustness of the 6-gene signature model, we first calculated a risk score for each sample in the AGING test set. Based on the threshold for the training set, the samples were divided into two groups with significantly different prognoses ( Figure 4A). The relationship between the expression of the 6 genes and the risk score was also consistent with the training set ( Figure 4C).
We similarly applied the model to all TCGA tumor samples and found that the low risk group fared significantly better than the high-risk group ( Figure 4B). The relationship between the expression of these 6 genes and the risk score was also consistent with the training set ( Figure 4D). Overall, the model effectively provided prognostic classifications with the TCGA datasets.
To verify the classification performance of the 6-gene signature model with different data platforms, we used GEO platform data as external datasets, used the model to calculate a risk score for each sample, and used the cutoff for the training set to divide the samples into high-risk and low-risk groups. The prognosis of the low-risk group was significantly better than that of the high-risk group ( Figure 5A). ROC analysis showed that the 5-year AUC was up to 0.74, compared with the training set. As in Figure 5B, the data in Figure 5C show that the relationship between the expression of the 6 genes and risk score is also consistent with the training set. Thus, the 6-gene signature model we selected was prognostic with both internal and external datasets.

Clinical independence of the 6-gene signature model
To assess the independence of the 6-gene signature model in clinical application, we used single-factor and multi-factor COX regression to analyze HRs, 95% CIs, and P values from TCGA training set, TCGA test set, and the GSE65858 data. We systematically analyzed the clinical information from the patients as recorded in TCGA and GSE65858 datasets, including their age, sex, disease stage, pathological TNM stage, and tumor stage, as well as our 6-gene signature (    These results show that our model 6-gene signature is a prognostic index independent of other clinical factors and exhibits independent predictive performance upon clinical application.

GSEA analysis of pathway differences enriched in the high-risk and low-risk groups
GSEA was used in TCGA training to analyze the pathways significantly enriched in the high-risk and low-risk groups. Twenty enriched pathways were detected (Supplementary Table 3), including focal adhesion, TGF-β signaling pathway, WNT signaling pathway, and ERBB signaling pathway, all of which are closely related to tumor occurrence, development and metastasis. Notably, these pathways are significantly enriched in the high-risk samples ( Figure 6).

DISCUSSION
In terms of prognosis, head and neck cancer is a highly heterogeneous disease in that survival times vary substantially among patients with similar TNM stages.
With the diagnosis and treatment of head and neck cancers at earlier stages, traditional clinicopathological indicators such as tumor size, vascular invasion, portal vein thrombus and TNM stage have proven inadequate for predicting individual outcomes, especially risk stratification, as no one-size-fits-all treatment strategy appears to be effective [11,12]. Consequently, screening prognostic molecular markers that adequately reflect the biological characteristics of tumors would be of great significance for individualized prevention and treatment of head and neck cancer patients. In the present study, we analyzed the expression profiles of 771 head and neck cancer samples from TCGA and the GEO and identified 6 genes robustly associated with OS. This signature is independent of other clinical factors.
Gene signatures are currently being used in clinical practice. Two examples are Oncotype DX [13][14][15], which provides a breast cancer recurrence score based on expression of 21 genes, and Coloprint, which provides a colon cancer recurrence score based on expression of 18 genes [16][17][18]. Results obtained with these assays have shown that screening new prognostic cancer markers based on gene expression profiles is a promising high-throughput molecular identification method. In that regard, Tian et al. [6] identified a 6-gene  ] identified a 172-gene signature through meta-analysis of gene expression. Although the AUC is high, the large number of genes that needs to be detected makes this analysis impractical for clinical use. By contrast, our 6gene signature has a high AUC using only 6 genes, which makes it conducive to clinical application.
The six genes in our signature include PEX11A, NLRP2, SERPINE1, UPK, and CTTN as risk factors, and D2HGDH as a protective factor. It has been reported that NLRP2 can be used as a marker of leukemia [19] and has an important impact on the prognosis after stem cell transplantation [20]. SERPINE1 is closely related to prognosis in ovarian cancer, gastric cancer, thyroid cancer and other tumors [21][22][23][24][25]. CTTN is closely related to prognosis in head and neck cancer, esophageal cancer, thyroid cancer, and glioma, among others [26][27][28][29]. D2HGDH is a marker of colorectal cancer, glioma, and prostate cancer [30][31][32][33]. PEX11A and UPK have not been previously reported to be related to cancer. Ours is the first study to suggest that they can be used as new prognostic markers of head and neck cancer. At the same time, our GSEA results show that the 6-gene signature enrichment significantly correlates with pathways and biological processes associated with the occurrence and development of HNSCC. This indicates that our model has potential clinical application value and could provide a potential target for diagnosis and for development of new targeted therapies that include, for example, novel alkylating agents [34,35].
Although we have identified potential candidate genes affecting tumor prognosis using bioinformatics technology with large samples, our study has limitations. First, the sample lacks some clinical followup information, so we did not consider factors such as the presence of other health conditions to differentiate prognostic biomarkers. Second, the results obtained using bioinformatics analysis alone are insufficient and need to be confirmed through experimental verification. AGING Therefore, further genetic and experimental studies with larger samples and experimental validation are needed.

CONCLUSIONS
In our study, we developed a 6-gene signature prognostic stratification system, which has good AUC values in both the training set and validation set, and is independent of other clinical features. Compared to clinical features, gene classifiers can improve survival risk prediction. We therefore recommend using this classifier as a molecular diagnostic test to assess prognostic risk in patients with head and neck cancer.

Data acquisition and processing
The  Tables 6 and 7). The GSE65858 data set was used as an external verification set for each group of sample information (Table 4).

Univariate Cox proportional risk regression analysis
As described previously by Jin-Cheng et al. [37], univariate Cox proportional hazard regression analysis was performed with each immune gene to screen out genes significantly associated with OS in the training data set. P < 0.01 was chosen as the threshold.

Analysis of copy number variation data
GISTIC software is widely used to detect both broad and focal (potentially overlapping) recurring events. We used GISTIC 2.0 [38] to identify genes exhibiting significant amplification or deletion. The parameter threshold was that the length of the amplification or deletion was more than 0.1, and the fragment was P < 0.05.

Gene mutation analysis
We used Mutsig 2.0 software with TCGA mutation data to identify genes with significant mutations in their MAF files. The threshold used was P < 0.05.

Construction of a prognostic gene signature
We chose the genes that were significantly related to OS and genes that were exhibited amplification, deletion or mutation. We further used the Random Survival Forest algorithm to rank the importance of prognostic genes. Like Jin et al. [39], we used the R package random Survival Forest to screen the prognostic genes. We set the number of Monte Carlo iterations to 100 and the number of steps forward to 5, and identified the genes whose relative importance as characteristic genes was greater than 0.4. In addition, we carried out a multivariate Cox regression analysis, and the following risk scoring model was constructed using the Equation 2: where N is the number of prognostic genes, k Exp is the expression value of the prognostic genes, and HR k e is the estimated regression coefficient of genes in the multivariate Cox regression analysis.

Functional enrichment analyses
GO and KEGG pathway enrichment analysis was performed using the R package cluster profiler for genes [40], to identify over-represented GO terms in three categories (biological processes, molecular function and cellular component) as well as over-represented KEGG pathway terms. For this analyses, a false discovery rate < 0.05 was considered to indicate statistical significance.
GSEA [41] was performed using the http://software. broadinstitute.org/gsea/downloads.jsp website with the MSigDB [42] C2 Canonical pathways gene set collection, which contains 1320 gene sets. Gene sets with a false discovery rate < 0.05 after performing 1000 permutations were considered to be significantly enriched.

Statistical analysis
When the median risk score in each data set was used as a cutoff to compare survival risk between high-risk and low-risk groups, a Kaplan-Meier (KM) curve was drawn. Multivariate Cox regression analysis was used to test whether gene markers were independent prognostic factors. Significance was defined as P < 0.05. All analyses were performed using R 3.4.3.

Abbreviations
HNSCC: head and neck squamous cell carcinoma; TCGA: the cancer genome atlas; GEO: group on earth observations; SNPs: single nucleotide polymorphisms; CNV: copy number variation; SCC-A: squamous cell carcinoma antigen; HPV: human papilloma virus; MFA: the mutation annotation information; OS: overall survival; GO: gene ontology; KEGG: kyoto encyclopedia of genes and genomes; KM: Kaplan-Meier; HR: higher risk; PPI: protein-protein interaction.

CONFLICTS OF INTEREST
The authors declare that they have no competing interests. Please browse Full Text version to see the data of Supplementary Tables 1-6 Supplementary