Development of a Prognostic Gene Signature For Hepatocellular Carcinoma

Cuiyun Wu Shunde Hospital of Southern Medical University Yaosheng Luo First Peoples Hospital of Shunde: Shunde Hospital of Southern Medical University Yinghui Chen First Peoples Hospital of Shunde: Shunde Hospital of Southern Medical University Hongling Qu First Peoples Hospital of Shunde: Shunde Hospital of Southern Medical University Lin Zheng First Peoples Hospital of Shunde: Shunde Hospital of Southern Medical University Jie Yao (  jie.yao413@yahoo.com ) Shunde hospital


Introduction
Hepatocellular carcinoma (HCC) is the second most lethal type of cancer, possessing a 5-year survival rate of 18%and a multi-step development process that includesmultiple biological factors (1,2). The major risk factors for HCC development include gender (3), age (4,5), chronic infection with the hepatitis B virus (HBV) or hepatitis C virus (HCV)(6, 7), cirrhosis (8), alpha-fetoprotein (AFP) levels(9, 10), and various pathophysiological characteristics of the tumor (9,11). Despite several clinicopathological factors being associated with survival, accurate prognostic biomarkers for patients with HCC remain elusive.
Recently, large-scale cancer genomics projects, such as The Cancer Genome Atlas (TCGA), have provided pro les of many cancer types according to the majority of patients (12). The latest state-of-the-art highthroughput techniques can assess the expression levels of thousands of genes in a single assay, serving as a strong analytical tool for the capture of prognostic characteristics of cancers. Additionally, owing to its ability to detect key features among complex datasets, machine learning technology is becoming increasingly popular for modeling the progression and treatment of cancer (13). Machine learning can be employed to extract important features from a large number of genes with a view to providing potentially useful prognostic information. In most human cancers, the prognosis is closely related to pathological criteria, particularly the histological grade. In addition to these high-risk etiological factors, the latest insights into HCC biology also indicate that genetic and epigenetic abnormalities play a vital role in the occurrence of liver carcinogenesis, such as molecular changes at different stages of HCC progression, which are induced by speci c etiological factors in most cases (14)(15)(16)(17). The majority of cancer research is focused on the elucidation of driver genes since this helps to uncover predictive information and identify prognostic biomarkers that can serve as therapeutic targets. In comparison with individual analyses, the integration of a panel of biomarkers substantially improves the prognostic value and greatly bene ts the clinical management of cancer (18). For example, Liu et al. constructed a risk score model consisting of ve stage-related genes, including NUP62, EHMT2, RANBP1, MSH6, and FHL2, which can successfully predict the recurrence of HCC within 1 year (19). Moreover, Li et al. validated a CIMPassociated prognostic model with an optimal ability to independentlypredict common clinical patterns related to survival (C-index:0.68;HR: 1.61; 95% CI:1.08-2.40; P = 0.019) (20). Furthermore, Yang et al.
developed a four-long non-coding RNA signature as a novel candidate biomarker for prognosis in HCC patients (AUC: 0.689; 95% CI: 0.617-0.761; P < 0.01) (21). Nevertheless, due to the varying quality of prediction models, the construction of an optimal prognostic model remains a hot topic.
Owing to the development of various testing methods, an increasing number of biological markers have been reported. Due to the longer time period and higher cost involved in RNA sequencing or microarray, a faster, more economical, and feasible detection method is preferred for HCC patients. However, to the best of our knowledge, the current genome-wide models that focus on predicting the prognosis in HCC patients display varying success. Therefore, in the present study, we aimed to develop a multi-gene signature using the least absolute shrinkage and selection operator (LASSO) Cox regression model to predict the prognosis in a group of HCC patients obtained from TCGA. Through analysis of the areas classi ed as low-risk and high-risk by our model, we were able to obtain insight into the most relevant genes used for HCC patient strati cation. In particular, we sought to establish a nomogram based on a prognostic gene signature and baseline clinicopathological risk factors with a view to assessing the e cacy of our prediction model. We show that the robust prognostic mode based on LASSO Cox regression possesses an optimal prediction of prognosis independent of common clinical patterns and may act as a promising therapeutic biomarker for HCC.

Materials And Methods
Data collection and processing. Gene expression pro les and the corresponding clinical information were obtained from TCGA. The original mRNA expression count values were normalized using the Trimmed Mean of M (TMM) method by package "edgeR".Main exclusion criteria were shown as follows, without follow-up or survival status, no mRNA expression data and the expression level of mRNA is greater than zero and exists more than three databases.As a result, data derived from239 HCC samples and 23 adjacent noncancerous tissue samples were included in the present study.
Identi cation of differentially expressed genes.Differentially expressed genes(DEGs) were screened among the data using limma package. Adjusted P-values of each gene was calculated using falsediscovery rate (FDR). FDR < 0.01 and |log FC|>3 was set as the cut-off threshold.
Development of the prognostic model.A total of 115 up-regulated genes were matched and extracted for the subsequent analysis. Survival analysis was then performed through COX regression and log-rank tests, and P values 0.01 was de ned as statistically signi cant.Importantly, the LASSO Cox regression model is optimal selection for high-dimensional microarray data analysis, which shrinks the respective regression coe cients by a penalty proportional to the size of the variable, thereby screening the variables and nally providing a relatively small number of variables with a weights less than zero. Based on the prognostic related DEGs, which were signi cant in the univariate Cox regression analysis, further screening was extracted using the LASSO algorithm. Consequently, multivariate Cox regression analysis was performed,and nine mRNAs were ultimately included to construct a Prognosis model. Hazard ratios (HRs) were calculated for each gene. Moreover, we constructed a prognostic risk score model which was based on a linear combination of expression levels weighted by multivariate Cox regression analysis coe cients according the following formula: risk score = Exp gene1 × β gene1 + Exp gene2 × β gene2 + .. . + Exp genen × β genen; where β representsthe HR of each gene. Next,239 patients were classi ed into two groups(high-risk group and low-risk group)by median risk score.Kaplan-Meier survival analysis and log-rank test wereperformed to compare the overall survival (OS) time.Survival curves were draws using Surviminer package. Furthermore, Receiver operating characteristic (ROC) curve were conducted to analyze the prognostic model. P < 0.05 was considered statistically signi cant. Finally, univariate and multivariate Cox regression analysis were performed to further investigate the independence of the prognostic gene signature to predict the outcomes of HCC patients.
Create a nomogram for the overall survival prediction in HCC.The prognostic gene signature and the risk factors that we analyzed were included to generate a nomogram to investigate the probability of 1-, 3-, 5and 7-year overall survival of patients with HCC. Additionally, a calibration curve was plotted to evaluate whether the actual outcomes approximate to predicted outcomes. Subjects and samples.6 human HCC sampleswere obtainedconsecutively from patients who diagnosed with HCC needed segmental hepatectomy in ShundeHospitalof Southern Medical University. This study was approved by the Ethics Institutional Review Boardof Shunde Hospital of Southern Medical University.
Isolation of total RNA and quantitative real-time polymerase chain reaction.Total RNA was extracted from HCC patients' liver sample with the TRIzol (15596026, Invitrogen, USA). Then total RNA was reversely transcribed with a HiScript II Q RT SuperMix kit (R222-01,Vazyme, China) and PCR ampli cation was performed by ChamQ SYBR qPCR Master Mix kit (Q311-02,Vazyme), GAPDH was regarded as internal controls. Primers sequences are listed in Table 1, which were synthesized by BGI technology (Guangdong, China). These reactions were conducted according to the manufacturer's instructions with CFX96 Real Time System (Bio-Rad, USA). sampleswas also obtained.

Data preprocessing
Firstly, the gene expression pro le data and the corresponding clinical information were downloaded from TCGA. The gene expression pro le data were merged with clinical information into a data frame for the next step. Each of the molecular pro les consisted of RNA-Seq expression data of 60,488 genes. The imported dataset consisted of 'expression,' representing the gene expression pro les of patients in terms of counts value (described in the Methods), 'clinical data,' representing clinical information related to the patients, and 'merged data'.

Identi cation of DEGs
After data preprocessing and normalization, the limma package in R language analysis was used to identify the DEGs. According to the threshold criteria (FDR <0.01 and |log FC|>3), we obtained 563 DEGs, including 448 downregulated (Table S1) and 115 upregulated (Table S2) genes.

Identi cation of nine genes associated with prognosis in HCC
To identify hub DEGs with potential prognostic value, univariate Cox regression analysis was applied, revealing that 41 genes were signi cantly correlated with overall survival (OS) (P < 0.01) (Table S3). Subsequently, the LASSO Cox regression model was used to build a prognostic classi er, and a total of nine genes, IQGAP3, BIRC5, PTTG1, STC2, CDKN3, PBK, EXO1, NEIL3, and HOXD9, were con rmed to be signi cantly associated with survival ( Figure 1). The hazard ratio (HR) of each gene was calculated: IQGAP3, PTTG1, STC2, PBK, NEIL3, and HOXD9 had an HR > 0 and were considered high-risk factors associated with a short survival time; and BIRC5, CDKN3, and EXO1 had an HR < 0 and were considered protective factors associated with a long survival time ( Figure 1). The expression levels of these nine genes in liver samples from HCC patients were measured by RT-qPCR (Figure 2), seven of which trended in the same direction. In addition, three (NEIL3, STC2, IQGAP3) were statistically signi cant.Next,Kaplan-Meier survival curves were generated to explore the potential roles of individual genes in OS. Among the nine-gene prognostic signature, IQGAP3, STC2, PBK, NEIL3, and HOXD9 were shown to signi cantly predict poor OS in the log-rank test (P < 0.05, Figure 3).

Development of a prognostic gene signature for survival prediction
Nine screened genes were selected to formulate a prognostic gene signature. According to the risk score, patients were separated into high-risk and low-risk groups. The heat map shows that the genes were upregulated with increasing risk scores ( Figure 4A). The risk score distribution and survival status of each patient are shown in Figure 4B and 4C, respectively. The high-risk group had a signi cantly higher mortality rate (42.02%) than the low-risk group (12.50%); the median survival time in the high-risk group was 4.27 years and that in the low-risk group was 8.56 years. The log-rank test shows a signi cant difference in mortality between the two groups (HR: 4.86; 95% CI: 2.72-8.69; P = 1.01E-08), and the Kaplan-Meier curve reveals that patients in the high-risk group had a poorer prognosis (P < 0.0001; Figure 5A). The AUC of the ROC curves was used to determine the predictive power of the prognostic gene signature; the AUC at 1, 3, 5, and 7 years was 0.84, 0.711, 0.812, and 0.829, respectively ( Figure 5B). These results show that the prognostic gene signature can effectively predict OS in patients with HCC.
Furthermore, univariate and multivariate Cox regression analyses were performed to investigate whether the gene signature was an independent predictive factor for prognosis in HCC patients. Following multivariate adjustment for conventional clinical patterns, including gender, age at diagnosis, pathological TNM, pathological grade and stage, vascular invasion, and serum AFP level, the gene signature remained a powerful and independent factor, con rming its robust ability to predict the OS in HCC patients (HR: 4.70; 95% CI: 2.61-8.38; P = 2.06E-07; Figure 6). In addition, strati ed analysis was employed for further data mining, which found that the prognostic gene signature was a predictive marker for male patients with HCC ( Figure 7A). Following strati cation of the clinicopathological risk factors, the risk score based on the prognostic signature was an independent predictive indicator of pathological TNM, pathological grade and stage, vascular invasion, and serum AFP level, and patients with a high risk score had a poorer prognosis ( Figure 7B-H). Thus, the prognostic gene signature can add OS predictive value to clinicopathological features.
Construction of a nomogram to evaluate the predictive accuracy of the prognostic gene signature To provide the clinician with a quantitative method to predict an HCC patient's probability of OS, we constructed a nomogram by integrating both the prognostic gene signature and clinicopathological risk factors ( Figure 8A). The results of our nomogram show that the prognostic gene signature had the greatest weight among the total points, consistent with previous multivariate regression analysis. The 30sample bootstrapped calibration plots for the prediction of OS at 1, 3, 5, and 7 years are shown in Figure  8B-E, respectively, the results of which indicate good consistency between the predicted and observed outcomes. Consequently, our results suggest that the nomogram is an optimal model for the prediction of HCC prognosis as compared with individual risk factors.

Discussion
HCC involves a complex molecular landscape and has a 5-year OS of only 10-15%, mostly due to late diagnosis (22)(23)(24). Studies have shown that the current clinicopathological parameters, such as age at diagnosis, gender, viral infection, TNM stage, and serum AFP levels, cannot accurately predict prognosis in HCC patients (25)(26)(27)(28)(29)(30). Thus, it is imperative to elucidate novel biomarkers to predict survival and monitor treatment response, which may also provide a link to speci c therapeutic options for combating HCC in individual patients.
In the present study, we evaluated the DEGs using mRNA expression pro les, and 115 signi cantly upregulated genes were selected for further analysis. Subsequently, LASSO Cox regression analysis was used to develop a prognostic model for HCC, the results of which show that the prognostic gene signature was signi cantly associated with OS in HCC patients. Moreover, an ROC was used as a metric for assessing the predictive performance of our model; the AUC at 1, 5, and 7 years was greater than 0.8. To date, the TNM staging system is the globally recognized Gold Standard for classifying the extent of cancer metastasis and is widely accepted as a tool for predicting prognosis and guiding treatment options for HCC. Univariate and multivariate Cox regression analyses were performed to explore the predictive power of our prognostic gene signature in combination with the TNM staging system and certain important clinical features. The results indicate that the prognostic gene signature can act as an independent predictive factor of OS following adjustment for conventional clinical characteristics. Our prognostic gene signature appears to potentially possess greater predictive power than that of traditional prognostic patterns. Furthermore, a nomogram combining our model with other important clinical patterns was created. Indeed, our model outperformed all other common clinicopathological features in predicting survival. Based on the calibration plot, there was good consistency between the actual and predicted values for the 1-, 3-, and 5-year OS; therefore, our prognostic gene signature predicts survival in HCC and provides a personalized score for individual patients. Accordingly, our nomogram is a valuable novel prognostic method for clinicians in the future.
Previous studies have identi ed the biological function of multiple genes involved in the development of HCC, including those used in our model. STC2 overexpression can promote proliferation of HCC, ovarian cancer, and stem cells (31)(32)(33). Similarly, PBK promotes metastasis of HCC by activating the ETV4-uPAR signaling pathway (34). PTTG1 has been reported to be associated with poor prognosis and angiogenesis in HCC (35). IQGAP3 functions as a potential oncogene by activating metastasis and EMT in HCC (36). CDKN3 has been correlated with poor tumor differentiation and advanced tumor stage by promoting cell proliferation in HCC (37). HOXD9 acts as a tumor promoter to drive metastasis by regulating ZEB1 in HCC (38), and EXO1 overexpression is associated with poor prognosis in HCC patients (39); however, there have been few reports regarding the effect of NEIL3 on HCC. Two selected genes show different trends in comparison with TCGA data, likely due to race and sample size, which needs further con rmation using larger datasets. While these studies have profoundly improved our understanding of the progression of HCC, less is known regarding how such biomarkers impact patient outcome. It appears that such individual biomarkers are limited because they can be in uenced by many factors, rendering their predictive ability unstable; thus, other recent studies have shown that combined analysis of a panel of biomarkers is the most promising approach to change clinical management (40,41). With respect to HCC, since the accuracy of various prognostic models still varies, the eld currently warrants further research (19)(20)(21). In the present study, the LASSO Cox regression model was used to select the most useful prognostic markers from many genes related to HCC. Furthermore, the LASSO Cox regression model allowed us to integrate multiple genes into one signature, with the AUC at 1, 5, and 7 years exceeding 80%, displaying signi cantly greater prognostic accuracy than previous studies and indicating that it can be potentially applied as an adjuvant therapy. Indeed, we present an optimal prognostic model for the rst time; however, despite the fact that our research highlights novel ideas related to the prognostic assessment of HCC, it still has some limitations. Firstly, our research is based on a singleomics technique (RNA-Seq); therefore, it may have different heterogeneity due to other omics data platforms. Secondly, our model does not have external data veri cation. Thirdly, the biological functions and molecular mechanisms of the nine indicator genes have been further evaluated with RNA expression to accelerate their clinical veri cation in HCC.

Conclusion
In summary, our ndings indicate that the nine-gene prognostic signature can classify patients with HCC into low-risk and high-risk groups with respect to mortality. In comparison with other clinicopathological features, our prognostic gene signature proposes a more accurate assessment of prognosis. Moreover, we created a nomogram that numerically predicts the OS of individual HCC patients based on the prognostic gene signature and certain important clinicopathological features. This information can be used to predict patient prognosis and make individualized decisions regarding treatment and surveillance.

Declarations
Ethical Approval and Consent to participate Human Investigations Committee of Shundehospital (Fo Shan) approve to use thediscard tissue of liver cancer patients for qPCR. Informed consent for the procurement and analysis of these sampleswas also obtained.

Figure 1
Multivariate cox Regression analysis of the ninegenes.  Kaplan-Meier survival curves of genes.The yellow line represents the high-risk group and the blue represents the low risk-group.p<0.05 in Log-rank test.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. TableS1DownregualatedDEGsofHCCinTCGA.xls TableS2UpregulatedDEGsofHCCinTCGA.xls TableS3DEGscorrelatedwithOSinTCGA.xls