A gene-based risk score model for predicting recurrence-free survival in patients with hepatocellular carcinoma

Hepatocellular carcinoma (HCC) remains the most frequent liver cancer, accounting for approximately 90% of primary liver cancers worldwide. The recurrence-free survival (RFS) of HCC patients is a critical factor in devising a personal treatment plan. Thus, it is necessary to accurately forecast the prognosis of HCC patients in clinical practice. Using The Cancer Genome Atlas (TCGA) dataset, we identified genes associated with RFS. A robust likelihood-based survival modeling approach was used to select the best genes for the prognostic model. Then, the GSE76427 dataset was used to evaluate the prognostic model’s effectiveness. We identified 1331 differentially expressed genes associated with RFS. Seven of these genes were selected to generate the prognostic model. The validation in both the TCGA cohort and GEO cohort demonstrated that the 7-gene prognostic model can predict the RFS of HCC patients. Meanwhile, the results of the multivariate Cox regression analysis showed that the 7-gene risk score model could function as an independent prognostic factor. In addition, according to the time-dependent ROC curve, the 7-gene risk score model performed better in predicting the RFS of the training set and the external validation dataset than the classical TNM staging and BCLC. Furthermore, these seven genes were found to be related to the occurrence and development of liver cancer by exploring three other databases. Our study identified a seven-gene signature for HCC RFS prediction that can be used as a novel and convenient prognostic tool. These seven genes might be potential target genes for metabolic therapy and the treatment of HCC.


Background
In 2018, liver cancer remained among the top six prevalent carcinomas. There were 841,080 new patients, and 781,631 patients died of liver cancer according to the Global Cancer Statistics [1,2]. Hepatocellular carcinoma (HCC) is the most frequent liver cancer, accounting for approximately 90% of primary liver cancers [3]. Currently, Hepatectomy and Radiofrequency ablation are the main two ways to treat HCC [4,5]. Despite the continuous development of medical technology, the outcome of many patients who receive treatment and the prognosis of liver cancer remain poor with a 2-year recurrence rate of 76.9% [6][7][8]. And many studies have shown that HCC is the most difficult to cure cancer, and because of this, HCC has been described as a "chemoresistant" tumor [9]. Because of this, the prognosis of HCC is poor. The recurrence-free survival (RFS) of HCC patients is a critical factor in devising a personal treatment plan [10]. Thus, it is necessary to accurately forecast HCC patients' prognosis to improve the prognosis of HCC. Most previous studies constructed prognostic models using the Tumor-Node-Metastasis (TNM) Fig. 1 GO functional and KEGG pathway analyses. a Summary of the differentially expressed genes and GO pathway enrichment. Red, blue, and green bars represent the biological process, cellular component, and molecular function categories, respectively. The height of the bar represents the number of differentially expressed genes observed in each category. b The top 10 pathways of genes associated with RFS staging system to assess the prognosis of HCC patients [11]. However, the TNM staging system does not predict the prognosis of HCC. Therefore, it is important to develop a reliable tool for clinicians to predict the prognosis of patients with HCC.
Given the remarkable advances in high-throughput technologies, the development of The Cancer Genome Atlas (TCGA) (https://portal.gdc.cancer.gov/) and the intergovernmental Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/gds) database provides an abundance of high-quality information regarding HCC [12]. Hence, it is urgent to develop methods to identify reliable therapeutic gene targets that could enable earlier prognostic evaluation and better therapeutic strategies [13]. Therefore, we considered whether we could build a gene-based risk score model [14]. Our goal was to generate simple and effective prognostic tools based on several genes and other factors that may affect RFS [13,15]. Using the TCGA dataset, we selected 7 genes by robust likelihood-based survival modeling and built a risk score system [16,17]. We used an independent dataset (GSE76427) to validate the effectiveness of the risk score system and demonstrate that its clinical value in predicting RFS in HCC patients is better than that of the TNM staging system.

Data collection and survival analyses
First, we downloaded gene expression profiles and clinical information from The Cancer Genome Atlas-liver hepatocellular carcinoma (TCGA-LIHC) dataset, which included 334 HCC samples [18]. We used GSE76427, which contained the gene expression and clinical information of 115 HCC samples, as the validation group. The samples in TCGA-LIHC and GSE76427 that met the following inclusion criteria were included in this study: all samples had mRNA sequencing data and clinical information related to RFS [19].

Identification of genes associated with RFS
The raw count data were normalized with a log(a + 1) transformation. Then, using the "survfit" function in the "survival" package, we plotted Kaplan-Meier curves for the high and low expression groups of each gene. A log rank test with a p-value less than 0.05 was considered statistically significant [20].

Enrichment analysis of GO functions and KEGG pathways
For the selected genes, we used WebGestalt (http:// bioinfo.vanderbilt.edu/webgestalt) based on Gene Ontology (GO) functions and the Kyoto Encyclopedia of Genes and Genomes (KEGG) to understand the biological significance of the identified genes [21].

Identification of the best genes for modeling
A robust likelihood-based survival approach was used to identify the best genes for modeling after determining the genes associated with RFS [22]. We used the "rbsurv" package in R to complete this modeling process.
Construction and validation of the risk score system A multivariate Cox regression analysis and "rbsurv" analysis were performed to identify the genes related to RFS and construct the prognostic gene signature. The "survi-valROC" package in R was used to investigate the timedependent prognostic value. The optimal cut-off values based on ROC curves were obtained to classify the patients into low-risk groups and high-risk groups. A calibration curve and the concordance index (C-index) were used to evaluate the risk score system.

External validation of the risk score system
We calculated the risk score in the GSE76427 dataset. Then, the AUCs of the 12-month, 15-month, and 18month RFS and Kaplan-Meier curves were used to verify the risk score system. A calibration curve was used to validate the risk score system. In addition, the prognosis-related genes included in the risk score system were verified at the protein level by using The Human Protein Atlas database. The CBioPortal for cancer genomics was used to study genetic alterations in the risk score system [23].

Statistical analysis
The statistical tests were performed using R software and SPSS. Univariate and multivariate Cox regression analyses were performed using a forward stepwise procedure. A p-value less than 0.05 was considered statistically significant [23].

Acquisition of the gene expression and clinical data
We

Genes associated with RFS
We used the "survfit" function in the "survival" package and found 1331 genes associated with RFS. Then, to explore the genetic biological implications, we analyzed the 1331 genes through Gene Ontology (GO) functional and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses. As shown in Fig. 1, in the KEGG analysis, we found that these genes are enriched in signaling pathways, such as the cell cycle, homologous recombination, DNA replication, the Fanconi anemia pathway, complement and coagulation cascades, and the T cell receptor signaling pathway.
In total, 334 patients were divided into two groups (134 high-risk patients and 200 low-risk patients) using a cut-off of 4.9798 for the risk score. Furthermore, the survival curve revealed that the RFS in the high-risk group was significantly poorer than that in the low-risk group (p < 0.0001; Fig. 2).

Validation of the prognostic model in GSE76427
We validated the risk score system in the GSE76427 cohort. In total, 108 patients were divided into two groups (45 high-risk patients and 63 low-risk patients) using a  cut-off of 3.4144 for the risk score. Furthermore, the survival curve revealed that the RFS in the high-risk group was significantly poorer than that in the low-risk group (p = 0.011; Fig. 3). In summary, these results indicate that the prognostic model has moderate sensitivity and specificity.
Association between the prognostic model and the clinical characteristics of the patients While assessing the correlation between the seven-gene signature and the clinical characteristics of the HCC patients, we found that a high risk score was significantly correlated with the TNM stage (p < 0.001), grade (p = 0.001), and AFP  (p = 0.014), but was not significantly associated with the gender, age, BMI, or Child-Pugh score of the patients with HCC (Table 2). In GSE76427, the results showed that the 7-gene signature was not significantly associated with gender, age, BCLC (Barcelona Clinic Liver Cancer) or the TNM stage (Table 3).

Independent prognostic role of the prognostic gene signature
Moreover, the results of the multivariate Cox regression analysis showed that the TNM stage (HR = 1.680, p < 0.001) and our prognostic model (HR = 3.607, p < 0.001) were both independent factors of RFS among the 334 TCGA-LIHC patients. However, among the 108 patients in the GSE76427 cohort, the TNM stage was not an independent prognostic factor for RFS [24]. The prognostic model (HR = 2.407, p = 0.014) was also an independent factor for RFS (Fig. 4). In addition, we performed univariate and multivariate Cox regression with other well-known pathological factors such as vascular   (Table 5).
Overall, our prognostic model showed a benefit in predicting the RFS, which might help doctors with targeted treatment (Fig. 6).

Development of the calibration curve
We calculated the C-index and drew calibration curves for the 12-, 15-and 18-month survival predictions to evaluate the calibration in the TCGA-LIHC dataset and the GSE76427 dataset. The C-index of the TCGA-LIHC dataset and GSE76427 dataset was 0.717 and 0.647, respectively, as shown in Figs. 7 and 8.

External validation in an online database
The representative protein expression levels of SLCO2A1, PPAT, GAS2L3, CD3EAP, and ACAT1 were explored in the Human Protein Profiles. Then, we explored the TTK, C16orf54, PPAT, CD3EAP, SLCO2A1, ACAT1, and GAS2L3 genes in the CBioPortal for cancer genomics. TTK exhibited the most frequent genetic alterations (3%), and deep deletion was the most frequent alteration. The second most altered gene was CD3EAP (1.3%), and the most frequent alterations were amplification mutations (Fig. 9). The expression levels of the seven genes in different cancers are shown in Fig. 10. In summary, the aberrant expression of these seven genes may explain some of the abnormal expression of these genes.

Discussion
In this study, we developed a risk score based on seven genes that has the ability to predict the probability of RFS in HCC patients and is more accurate than clinical indicators. Using this model, we can identify patients with HCC who have a higher risk of recurrence, indicating that these patients need more attention. In the TCGA-LIHC dataset, in total, 1331 genes were found to be associated with RFS in HCC patients. In the KEGG analysis, we found that the 1331 genes were enriched in signaling pathways, such as the cell cycle, homologous recombination, DNA replication, the Fanconi anemia pathway, complement and coagulation cascades, and the T cell receptor signaling pathway. This finding suggests that the 7-gene signature might affect the RFS of HCC patients through these pathways. Then, we selected the best 7 genes to develop the risk score model as follows: TTK, C16orf105, PPAT, CD3EAP, SLCO2A1, ACAT1, and GAS2L3. Additionally, our study showed that the TNM staging system is not an accurate indicator for the prediction of RFS in HCC patients, which is consistent with the results of other studies. According to the prognostic model, we divided the patients into low-and high-risk groups, which exhibited significant differences in RFS. This result indicated that the prognostic model could be used as a conventional tool for the prediction of the RFS of HCC patients.    The prognostic model was validated using another independent dataset, i.e., GSE76427. The area under the curve revealed the ability of the prognostic model to differentiate the patients' prognoses; the survival curve represents the survival of the high-risk group, which had a worse prognosis compared with that of the low-risk group. These findings demonstrate that the prognostic model has the ability to forecast RFS in HCC patients.
Most of the seven genes in our prognostic model have been reported to be involved in cancer. The TTK protein levels differ in human liver cancer between liver cancer cells and adjacent noncancerous liver cells [25]. This study also tested the utility of TTK-targeted inhibition and demonstrated its therapeutic potential in an experimental model of liver cancer in vivo. Furthermore, our study demonstrated its effectiveness and incorporated it into the prognostic model. PPAT, which a member of the purine/pyrimidine phosphoribosyl transferase family, regulates pyruvate kinase activity and cell proliferation and invasion and is a biomarker of lung adenocarcinoma. Acetyl-CoA acetyltransferase (ACAT) was recently reported to be elevated in human cancer cell lines [16]. ACAT1 exhibits acetyltransferase activity and can acetylate pyruvate dehydrogenase (PDH), which affects tumor growth [26].
In other scholars' prognostic analysis of HCC, CD3EAP is also a predictor, suggesting that CD3EAP is an important predictor of HCC prognosis, but the function of CD3EAP is not completely clear [27]. The function of GAS2L3 is still unknown, and GAS2L3 may be involved in mediating the absorption and clearance of prostaglandins, but its function in liver cancer has not been reported [19]. Moreover, SLCO2A1 and C16orf105 have not been reported in previous HCC studies, indicating that these genes may be potential factors in the treatment of HCC. Understanding the function of these genes may promote the development of HCC treatment.
However, despite the potential substantial clinical significance of our results, this study still has some limitations. One limitation is that although the calibration curve performance and AUC value were excellent in the validation group, multicenter clinical application is needed to further evaluate the external utility of the prognostic model [28]. Second, only 1331 genes were defined as genes associated with RFS and evaluated for the prognostic model construction. Some important genes could have been excluded before building the prognostic model [29]. In addition, knowledge regarding signaling pathways is urgently needed to reveal the functions of these genes in HCC. Finally, other well-known pathological factors, such as vascular invasion and hepatic virus infection status, should be key topics of our further studies. After collecting clinical tumor tissues with pathological information, we will find a way to combine our risk score with these clinical characteristics. Meanwhile, we have realized that many studies showed that different surgical methods had an impact on the prognosis of HCC patients. We will pay attention to distinguishing surgical methods when collecting clinical cases and compare the difference in the predictive effect of risk score on RFS in patients receiving different surgical methods in our future study.

Conclusions
In conclusion, we developed and validated a prognostic model for the prediction of the RFS probability of HCC patients. The simple prognostic model has the ability to predict RFS and could be a useful tool for doctors conducting an evaluation of HCC and selecting treatment plans for HCC patients.