A prognostic model for oral squamous cell carcinoma using 7 genes related to tumor mutational burden

Oral squamous cell carcinoma (OSCC) is a rising problem in global public health. The traditional physical and imageological examinations are invasive and radioactive. There is a need for less harmful new biomarkers. Tumor mutational burden (TMB) is a novel prognostic biomarker for various cancers. We intended to explore the relationship between TMB-related genes and the prognosis of OSCC and to construct a prognostic model. TMB-related differential expressed genes (DEGs) were screened by differential analysis and optimized via the univariate Cox and LASSO Cox analyses. Risk Score model was constructed by expression values of screened genes multiplying coefficient of LASSO Cox. Seven TMB-related DEGs (CTSG, COL6A5, GRIA3, CCL21, ZNF662, TDRD5 and GSDMB) were screened. Patients in high-risk group (Risk Score >  − 0.684511507) had worse prognosis compared to the low-risk group (Risk Score <  − 0.684511507). Survival rates of patients in the high-risk group were lower in the gender, age and degrees of differentiation subgroups compared to the low-risk group. The Risk Score model constructed by 7 TMB-related genes may be a reliable biomarker for predicting the prognosis of OSCC patients.

. Novel indicators such as immune-related genes [14], systemic inflammatory biomarkers [15], ferroptosisrelated genes [16] are emerging as effective biomarkers to stratify patients with different prognosis. These identified biomarkers provide a relatively comprehensive understanding of prognosis in OSCC and provide an additional tool for selecting patients who need more aggressive treatment. In order to improve accuracy of the prediction, more biomarkers are urgently needed to be explored to provide an additional tool for prediction of prognosis for cancer patients [17,18].
Tumor mutational burden (TMB) is defined as the number of mutations existing within a tumor and is often reported as the number of mutations per DNA megabase of genomic territory [19]. Because of the development of next generation sequencing techniques, a cost-and time-effective sequencing of genes makes significant improvement in detecting gene mutations [20]. Growth and progression of cancers are reported to be related to the immune suppression, and in order to evade immunosurveillance and eradication of the host immune system, tumors often upregulate immune checkpoints [21,22]. Immunotherapies based on immune checkpoint inhibitors (ICIs) have emerged as a new treatment for many types of cancers [23]. High TMB levels is often connected with better prognosis and higher rates of treatment response after ICIs therapy which may attribute to higher potential immunogenic neoantigens facilitating anti-tumor immune response [24,25]. And TMB levels are emerging as a novel prognostic biomarker for the response to immunotherapy in oncology clinic [21,26,27]. Previous studies have reported that cancer patients with higher TMB levels have higher response rates following ICI therapy than those with lower TMB levels, for example non-small cell lung cancer (NSCLC) [28], melanoma [29] and breast cancer [26]. TMB levels are used for the prediction of the prognosis for cancer patients following immunotherapy in solid tumors such as breast cancer, lung cancer and gastrointestinal cancers [30]. And Kang et.al. have reported that TMB was also related to the prognosis of cutaneous melanoma and prognostic model constructed by TMB-related grenes might be used to predict prognosis of cancer patients [31]. These researches support that TMB has the potential as a promising biomarker for predicting the cancer patients with different prognosis [32]. Although previous studied have identified the prognostic signature constructed by TMB-related genes for patients with ovarian cancers [33] and the prognostic value of TMB for patients with head and neck squamous cell carcinoma has also been studied [34], there are few articles about the prognostic value of TMB-related genes for OSCC patients and the prognostic signature constructed by TMB-related genes for OSCC patients has not been throughly explored. We aimed to explore the prognostic value of TMB-related genes for and to build a prognostic signature for OSCC patients. Besides, Risk Score models constructed by molecular biomarkers utilizing LASSO Cox regression analysis have already been used to diagnose and to predict the prognosis of patents in solid tumors [35].
In this study, we explored the connection between TMB-related genes and the prognosis of OSCC patients through bioinformatic analysis. We hoped to construct a prognostic Risk Score model to be helpful in separating patients with different prognosis.

Data collection
Data of the mRNA expression and clinical information about 306 OSCC patients were downloaded from the Cancer Genome Atlas (TCGA, https:// tcga-data. nci. nih. gov/ tcga/) database. The full clinical information of 306 OSCC patients was shown in Table 1. Maf files of 311 OSCC patients were also downloaded from the TCGA database for further analysis and clinical information of 311 OSCC patients was listed in Table 2. Moreover, the dataset GSE41613 was downloaded from the Gene Expression Omnibus (GEO, https:// www. ncbi. nlm. nih. gov/ geo/) database. The dataset was comprised of 97 OSCC patients with complete survival information. The GEO dataset was derived from HPV-negative OSCC patients, while the HPV status of TCGA cohort was unknown. Patients' data were processed by the Affymetrix Human Genome U133 Plus 2.0 Array.

Differential analysis
Analysis of differential expression genes (DEGs) was based on limma function package [36] of R programming software (version4.1.0, the same below). The |Log 2 FC|> 1 and adjusted P value ≤ 0.01 were used to screen DEGs associated with TMB.

Functional enrichment analysis
Functional enrichment analysis was applied to the TMBrelated DEGs using the "clusterProfiler" package [37] of R programming software. The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO, including Biological Process, Molecular Function and Cellular Component) terms were used to examine the enriched GO terms and KEGG pathways. Threshold of adjusted P value < 0.05 of Benjamini and Hochberg (BH) method were used to screen significantly enriched GO terms and KEGG pathways.

Construction of prognostic Risk Score model
Univariate Cox regression analysis, based on the expression values of the 324 TMB-related DEGs, was performed on the 306 OSCC patients with the threshold of P < 0.01 to screen the DEGs which were associated with prognosis of OSCC. LASSO Cox regression analysis was performed on the screened DEGs to optimize the genes using the glmnet package [38] of R programming software. Then Risk Score of each patient was calculated using the following formula based on the screened TMB-related DEGs: In this formula, Coef i and X i are the coefficient calculated by LASSO Cox and expression value of each gene (the expression value of mRNA in this research) respectively. Survival, survminer and two-sided logrank test of R package were used to test the Risk Scores. Patients were assigned into low-risk and high-risk groups according to the median of Risk Score.

Survival analysis
We used the survival and survminer packages of R programming software to estimate the OS rates of different groups using Kaplan-Meier method. The significance of difference of OS rates between different groups was tested by log-rank test. The independence of Risk Score in predicting the prognosis of OSCC patients was examined by multivariate Cox regression analysis. Age and degree of differentiation were included in the study as the clinicopathological factors affecting the prognosis of many cancers. The TNM status was included in the multivariate Cox regression analysis as the reliable maker for treatment decision but the tumor size or/and extension (T) was excluded because of its inadaptability to the regression model. The four factors and Risk Score were included in the multivariate Cox regression analysis as variables.

Construction of nomogram model
Nomogram is often used to predict the prognosis of many types of cancers. In our study, nomogram constructed by the independent factors was built to predict 1-year, 3-year and 5-year OSs of patients. We used rms (https:// CRAN.R-proje ct. org/ packa ge= rms) package of R programming software to build nomogram. In order to observe the accuracy of the predicted probability, a calibration curve was drawn.

Calculation of immune cell infiltration proportion
CIBERSORT algorithm [39] was based on gene expression matrix. It used predisposed 547 barcode genes and employed deconvolution method to characterize the composition of immune infiltration cells. And CIBER-SORT was used to calculate the infiltration proportion of 22 immune cells of each patient. The sum of estimated proportions of all immune cells in each patient was 1. Figure 1A was a flow diagram of study to clarify the design of our study. We processed the maf files of 311 OSCC patients using maftools package of R programming software and the results showed TP53, TTN and FAT1 had the highest mutation rates (Fig. 1B). 161 patients with TMB values in the front 25% (≤ 1.16) and the back 25% (≥ 2.38) were divided into the low-TMB group and high-TMB group (Fig. 1C). Then 157 mRNA expression profiles were found from the 306 patients. 324 DEGs associated with TMB were screened out of the 157 patients, using the limma function package of R programming software. In the high-vs low-TMB group, up-regulated genes and down-regulated genes were 48 and 276 respectively (Fig. 1D). Expression levels of DEGs were significantly different between high-and low-TMB groups (Fig. 1E).

Go terms and KEGG pathway analysis of DEGs
GO and KEGG enrichment analyses were done to the 324 TMB-related DEGs. Significant enrichment terms were found among the muscle system process, contractile   Fig. 2A, B. Full results of Go and KEGG enrichment analyses were listed on Additional file 1: Table S1 and Additional file 2: Table S2.

Construction and validation of Risk Score model
Univariate Patients in the TCGA database and GEO validation sets were divided into high-risk and low-risk groups according to median of Risk Scores (-0.684511507). It was found that in TCGA database and GEO validation sets, patients in the high-risk group (patients with the Risk Score > 0.684511507) had lower OS rates than those in the low-risk group from the survival analysis (Fig. 3C,   D). Other than that, from time-dependent receiver operating characteristic (ROC) analysis, the area under the curve (AUC) of 1-year, 3-year, and 5-year survivals of patients in the TCGA database were 0.67, 0.67 and 0.64 (Fig. 3E); the AUC of 1-year, 3-year, and 5-year survivals of patients in the GEO validation sets were 0.64, 0.62 and 0.66 (Fig. 3F). It indicated that the Risk Score model was a reliable prognostic indicator of OSCC patients in both datasets.

Independence of the Risk Score as a prognostic indicator
As the tumor size or/and extension (T) did not fit to the regression model, five factors comprising of age, degree of differentiation, regional lymph node involvement (N), distant metastasis (M) and median of Risk Score were included in the multivariate Cox regression analysis (one non-differentiation sample was removed) to test whether the Risk Score was an independent prognostic indicator. The results (Fig. 4A) showed that the Risk Score and degree of differentiation were significantly associated with OS. The low-risk group had lower death risk and the low Risk Score was a reliable prognostic factor (HR = 0.46, 95%CI: 0.32-0.65, P < 0.001). The results on fulfillment of proportional hazard's assumption for all variables included in the multivariate Cox regression analysis were listed in Additional file 3: Table S3.
To further discuss the prognostic value of Risk Score for OSCC patients in various clinical pathological factors (including age, gender and degree of differentiation), the OSCC patients were regrouped according to age, gender and degree of differentiation to perform Kaplan-Meier survival analysis. Results showed that between female and male subgroups ( Fig. 4B-C), ≤ 61 years and > 61 years old subgroups (Fig. 4D, E), as well as G1, G2 and G3 degree of differentiation subgroups (Fig. 4F-H), the OS rates of patients in the high-risk groups were all lower than those of the low-risk groups. These results indicated that Risk Score was an independent prognostic indicator for OSCC patients.

Nomogram model predicts the prognosis of OSCC patients
The four independent risk factors including age, degree of differentiation, N status were used to construct nomogram model (Fig. 5A). For each patient, to obtain the actual point of Risk Score and degree of differentiation, two lines were drawn up to determine the points and the sum of the two points was located on the "Total Points" axis. Then a line was drawn down from the "Total Points" axis to the 1-year, 3-year and 5-year OS axes to predict the survival probability of OSCC patients. The 1-year and 3-year calibrated curves in adjusted diagram were close to the ideal curve (a 45 degree line through the origin of the coordinate axis with slope 1), which suggested that the predicted probabilities of the model at 1 year, 3 years and 5 years agreed well with the actual results ( Fig. 5B-D).

Immune infiltration of OSCC patients between the highand low-risk groups
CIBERSORT algorithms was used, in combination with LM22 eigenmatrix, to estimate the differences of immune infiltration of the 22 types of immune cells in high-and low-risk OSCC groups. After summarizing immune cell infiltration results of 306 OSCC patients (Fig. 6A)   could be divided into two categories by principal component analysis (PCA) (Fig. 6M). Immune checkpoint expression has become a biomarker for selective immunotherapy in OSCC patients. The correlation between Risk Score of OSCC patients and key immune checkpoints (CTLA4, PDL1, LAG3, TIGIT IDO1 and TDO2) was analyzed and Risk Score was found to be associated with all of them (Fig. 7A). Meanwhile the expression of CTLA4, TIGIT and TDO2 differed significantly between the high-and low-risk groups (Fig. 7B-D) and they were higher in the low-risk group than in the high-risk group. CTLA4 was used in immune-related genes prognostic models for OSCC patients [14,42].

Discussion
OSCC is one of the most common cancers in the world. It poses a great challenge to the medical industry because of the high death rate,. In this research, we constructed a prognostic model for OSCC patients based on the TMBrelated genes to predict the prognosis.
In this study, functional enrichment analysis was performed on the TMB-related DEGs. The focal adhesion was listed on the top 20 KEGG pathways. Focal adhesion kinase (FAK) is a non-receptor tyrosine kinase which is associated with poor prognosis and can promotes breast cancer cell migration and metastasis [43,44]. Overexpression and phosphorylation of FAK also correlate with invasion and metastasis therefore affect the prognosis [45,46]. FAK-mediated signaling and functions are involved in the development and progress of tumor [47]. Applying of the FAK inhibitor can effectively reduce the invasion and metastasis of tumor tissue [48]. These are in keeping with the our results, which indicating prognosis of OSCC is associated with the TMB-related DEGs we screened.
Seven TMB-related genes (CTSG, COL6A5, GRIA3, CCL21, ZNF662, TDRD5 and GSDMB) were selected via differential analysis, univariate Cox analysis and LASSO Cox analysis. Among the 7 genes, CTSG is confirmed as a potential biomarker in OSCC and NSCLC, and expression of CTSG is highest in adenocarcinoma [49,50]. The expression of CCL21 is related to the poor clinical outcomes in OSCC patients via CCL21/CCR7 axis by activating the JAK2/STAT3 signaling pathway [51,52]. ZNF662 gene caused by epigenetic changes through DNA methylation is also related to the progression of OSCC [53]. Moreover, a risk signature constructed by using COL6A5 performed well in stratifying OSCC patients with different prognosis and could distinguish survival status of OSCC patients [54]. GRIA3, as glutamate receptor, is involved in the process of tumor progression in pancreatic cancer [55]. The TDRD5 is involved in the DNA methylation and has prognostic value for patients with hepatocellular carcinoma [56]. The GSDMB is highly expressed in cancer tissues and is connected with poor prognosis by relapse-free survival, and therefore has been used as a potential novel prognostic marker [57]. These indicate that the TMB-related genes we screened may relate to the prognosis of cancer patients. The results agree with the researches that TMB-related genes have been identified in many types of cancers to help us understand progression of cancers and may assist clinical doctors to predict the prognosis of many types of cancers [58][59][60], which is in accord with our results.
Patients were then assigned to high-and low-risk groups according to the median of Risk Score. The results showed that patients in the high-risk group had lower OS than those in the low-risk group. The Risk Score model might be a reliable prognostic indicator for the OSCC patients. Even when taking into account other clinical variables, the Risk Score model had independent prognostic value. Head and neck squamous cell carcinoma patients with high TMB level have worse prognosis than those with low TMB [34]. And TMB-related genes have been described as a powerful prognostic biomarker for patients with bladder cancer [61]. TMB-related genes may also serve as a potential biomarker with clinical benefit in patients with NSCLC [62]. There are also many prognostic model characterizing TMB-related genes expression levels in other cancer like hepatocellular carcinoma, osteosarcoma and colon cancer [63][64][65]. Our results are consistent with these previous researches. It indicates that the Risk Score model constructed by 7 TMB-related genes may be helpful for the prediction of OSCC with different prognosis. Previous study has identified a 13-gene signature to predict survival of patients with OSCC [66], and the model in our study was simplified to 7 TMB-related genes. Nomogram model, built by using degree of differentiation and Risk Score, was also able to be reliable in predicting the OS of OSCC patients (See figure on next page.) Fig. 4 Risk Score as an independent prognostic marker for OSCC. A Forest plots of multivariate Cox analysis. Compared to the reference, Hazard ratio > 1 is considered to be a higher death risk while Hazard ratio < 1 is considered to be a lower death risk. at 1 year, 3 years and 5 years, which makes the prognostic value of Risk Score model more reliable. The immune cell infiltration results showed that the infiltration proportions of the native B cells, M2 macrophage, resting mast cells and CD4 memory resting T cells were higher in low-risk group compared to the high-risk group, while the infiltration proportions of eosinophils, activated NK cells and follicular helper T cells were lower in low-risk group compared to the high-risk group. The M2 macrophage is reported to promote cancer progress and to be connected with poor outcome in certain cancer types [67]. The activated NK cells may serve as anti-tumor therapy by secreting IFN-γ and TNF-α to suppress tumor cell cycle [68]. These articles are in line with our results, suggesting the Risk Score model was reliable to stratify patients by survival time.
However, there are some limitations of our study. Firstly, the main sources of our data were from public database and it was driven by statistics of retrospective data, so the best cutoff value is needed to be determined before clinical application. Secondly, the establishment  Fig. 6 Immune infiltration between high-and low-risk groups. A The relative proportion of immune infiltrates in all patients. B Box plots of immune cell differences between high-and low-risk groups. The horizontal axis is the immune cells and the vertical axis is the relative infiltration proportion of immune cells. P value was calculated by wilcoxon method. (P > 0.05, *: P ≤ 0.05, **: P ≤ 0.01, ***: P ≤ 0.001, ****: P ≤ 0.0001). C Box plots of significantly different immune cells in the high-and low-risk groups. The horizontal axis and the vertical axis are the groups and the relative infiltration proportion of immune cells. P value was calculated by wilcoxon method. D-L The correlation diagrams of 9 immune cell and Risk Score. and verification of the signature were based on the TCGA and GEO datasets. And the HPV status of the TCGA cohort was unknown, which makes the prognosis of the OSCC patients less reliable because the HPV status is an important risk factor affecting the prognosis of patients with head and neck cancer [69]. Thirdly, the inhomogeneity of data in the public databases also makes the Risk Score model less reliable regarding to the prognosis of cancer patients. Therefore, robustness of the signature will be necessary to be verified using larger external datasets in the future.
In conclusion, the prognostic signature reliably predicted the survival of patients with OSCC. Potential clinical use of the signature are driven by its strong prognostic performance, but the performance of the signature is still required to be verified in larger clinical samples.

Conclusion
Our study suggests that the 7 TMB-related genes are associated with the prognosis of OSCC and Risk Score model constructed using TMB-related genes (CTSG, COL6A5, GRIA3, CCL21, ZNF662, TDRD5 and GSDMB) might be a reliable biomarker for predicting the prognosis of OSCC patients. The prognostic signature may be helpful in designing the individualized treatment management and in making of medical decisions in the future but the