Identification of a five-gene signature in association with overall survival for hepatocellular carcinoma

Background Hepatocellular carcinoma (HCC) is considered to be a malignant tumor with a high incidence and a high mortality. Accurate prognostic models are urgently needed. The present study was aimed at screening the critical genes for prognosis of HCC. Methods The GSE25097, GSE14520, GSE36376 and GSE76427 datasets were obtained from Gene Expression Omnibus (GEO). We used GEO2R to screen differentially expressed genes (DEGs). A protein-protein interaction network of the DEGs was constructed by Cytoscape in order to find hub genes by module analysis. The Metascape was performed to discover biological functions and pathway enrichment of DEGs. MCODE components were calculated to construct a module complex of DEGs. Then, gene set enrichment analysis (GSEA) was used for gene enrichment analysis. ONCOMINE was employed to assess the mRNA expression levels of key genes in HCC, and the survival analysis was conducted using the array from The Cancer Genome Atlas (TCGA) of HCC. Then, the LASSO Cox regression model was performed to establish and identify the prognostic gene signature. We validated the prognostic value of the gene signature in the TCGA cohort. Results We screened out 10 hub genes which were all up-regulated in HCC tissue. They mainly enrich in mitotic cell cycle process. The GSEA results showed that these data sets had good enrichment score and significance in the cell cycle pathway. Each candidate gene may be an indicator of prognostic factors in the development of HCC. However, hub genes expression was weekly associated with overall survival in HCC patients. LASSO Cox regression analysis validated a five-gene signature (including CDC20, CCNB2, NCAPG, ASPM and NUSAP1). These results suggest that five-gene signature model may provide clues for clinical prognostic biomarker of HCC.


INTRODUCTION
The incidence rate of hepatocellular carcinoma (HCC) ranks sixth among all malignant tumors and the mortality rate ranks third (Bray et al., 2018). More than 580,000 new cases are expected in Asia every year (Siegel, Miller & Jemal, 2019). The genetic aberrations, cellular environment and environmental effects are considered as responsible for the development, progression and metastasis of HCC (Bray et al., 2018).
Genomic research has been the focus of hepatocellular carcinoma treatment (Yan et al., 2019). Recently, high-throughput platform microarrays for analyzing gene expression have been widely developed (Mari et al., 2019) as an effective tool for identifying general genetic changes during tumorigenesis (Wan et al., 2019). Microarray techniques can not only find related genes of diseases, targets of anti-tumor drugs, but also prognostic analysis of tumor patients, and can reveal the relationship between gene expression and regulation. In clinical research, they also play the role of providing ideas for the diagnosis and treatment of certain diseases (Szuhai & Vermeer, 2015). We found that there have been studies exploring the prognostic signatures of colon cancer and lung adenocarcinoma, but the prognostic signatures of hepatocellular carcinoma need to be supplemented (Cao et al., 2020;Wei et al., 2018).
In this study, we chose four GEO series (GSE25097, GSE14520, GSE36376 and GSE76427) including hepatocellular carcinoma tumor tissue and non-tumor tissue samples.
We use GEO2R to screen out differentially expressed genes (DEGs) and to find hub genes by constructing their protein interaction network. This study is aimed at validating some potential targets to effectively assist clinical workers to predict overall survival of HCC patients.

Data adoption criteria
We have looked for publicly available series from the GEO Repository browser. Using ''hepatocellular carcinoma'' as a keyword, a total of 450 series were retrieved. Sort these series acting in accordance with the number of samples, set study type to ''Expression profiling by array'' and organism to ''Homo sapiens'', and look for the series with normal tissue and hepatocellular carcinoma tissue control. Make sure that the normal samples are taken from adjacent tissues of HCC patients. In the end, four HCC gene expression profiles (GSE14520, GSE25097, GSE36376 and GSE76427) were selected because they have more and better-quality samples. The normal samples in these four series were all taken from adjacent tissues of HCC patients. Data were downloaded from the publicly available database hence it was not applicable for additional ethical approval.

DEGs analysis
GEO2R (http://www.ncbi.nlm.nih.gov/geo/geo2r/) was implemented to screen out DEGs between HCC tumor and non-tumor tissue samples. After we obtain the data, we used Bio Tools v5.0 (http://www.chrisapp.xyz:3838/R/AnnoE2/) to draw a volcano map to find statistically significant differentially expressed genes. The adjusted P < 0.01 and |logFC|≥ 1 were set as the threshold, so the false positive result was eliminated as much as possible. In order to eliminate the background error caused by different research units on different platforms, we use the Venny 2.1.0 (https://bioinfogp.cnb.csic.es/tools/venny) mapping to screen out the shared DEGs.

PPI network construction
We imported initially screened genes into the STRING database (http://string-db.org/), web-based software designed to calculate the integration of protein-protein interactions (Ashburner et al., 2000) to obtain the highest confident genes (0.900). Then, the MCC algorithm in Cytohubba which an APP in Cytoscape (Version: 3.7.2) was utilized to screen out the hub genes (Chin et al., 2014).

Functional and pathway enrichment analysis
Metascape (http://metascape.org) was performed to analyze process enrichment analysis and pathway analysis of neighbor genes of hub genes. On the basis of Metascape tool, The Gene Ontology (GO) terms, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Reactome Gene Sets can be analyzed. Terms with a P value<0.01, minimum count of 3, and an enrichment factor >1.5 are collected and grouped into clusters depended on their membership similarities. Moreover, the MCODE algorithm of the network is used to identify closely connected neighborhoods of proteins (The Gene Ontology Consortium, 2019;Kanehisa et al., 2017;Wang et al., 2013). GSEA (v4.0.3) was used to verify the results of Metascape analysis (Subramanian et al., 2007;Mardinoglu, Gatto & Nielsen, 2013). The gene sets file (C2 KEGG v7.0 symbols), phenotype labels file, and expression dataset file and chip annotation file were prepared and loaded into GSEA. P < 0.05 was considered statistically significant.

The hub genes' transcription level analysis in patients with HCC
ONCOMINE (http://www.oncomine.org) was performed to analyze the mRNA levels of hub genes in HCC (Rhodes et al., 2004;Rhodes et al., 2007). Threshold limits were as follows: the data type was mRNA, fold change = 2 and P = 0.01. Differential analysis between normal and tumor tissues was performed for each gene. Meanwhile, we compared the level of hub genes expression between normal and tumor tissues by Gene Expression Profiling Interactive Analysis (GEPIA 2) (Tang et al., 2019). We set |Log2FC| Cutoff to 1, Jitter Size to 0.4, P-value Cutoff to 0.01, and compared HCC tumor samples ( n = 369) from the TCGA database with normal samples (n = 160) from the TCGA databases.

Survival analysis of hub genes
The Kaplan-Meier plotter (http://www.kmplot.com) contains gene expression data from 364 clinical HCC patients derived from TCGA (Menyhárt, Nagy & Győrffy, 2018). According to the median expression, these samples were divided into a low expression group and a high expression group. The Kaplan-Meier plotters were used to calculate the relapse-free survival (RFS), progression-free survival (PFS) and overall survival (OS) of all liver cancer patient samples.

Establishment of the prognostic gene signature
The mRNA expression and clinical data were downloaded from TCGA-LIHC and cBioportal, including 374 TCGA-LIHC and 50 normal control samples. All patients with a follow-up period less than 60 days were excluded for survival analysis. A prognostic gene signature was constructed based on the results of the least absolute shrinkage and selection operator (LASSO). Cox regression model coefficients (β) multiplied with its mRNA expression level. The risk score = (βgene 1 * expression level of gene 1) + (βgene 2 * expression level of gene 2) + (βgene 3 * expression level of gene 3) + . . . + (βgene n * expression level of gene n) (Huitzil-Melendez et al., 2010). We used the Survminer R package to find the optimal cut-off values. Then the Kaplan-Meier survival curve combined with a log-rank test was performed to compare the difference in overall survival between the high-risk score group and low-risk score group.

Statistical analysis
GraphPad Prism version 8.0 and R software version 4.0.2 (GraphPad Software Inc., USA) was used for statistical analyses. All tests were two-sided, P < 0.05 was considered statistically significant.

Identification of DEGs
We did the research as described in the flow chart ( Fig. 1). In order to screen the difference of gene expression between HCC and normal liver tissue, four gene expression series (GSE14520, GSE25097, GSE36376 and GSE76427) were downloaded from the GEO database. The profile of GSE14502 includes 225 HCC tumor tissues and 220 non-tumor tissues. The profile of GSE25097 includes 268 HCC tumor tissues and 243 non-tumor tissues. The profile of GSE36376 includes 240 HCC tumor tissues and 193 non-tumor tissues. The profile of GSE76427 includes 115 HCC tumor tissues and 52 non-tumor tissues (Table 1). We found 1095 DEGs in GSE14520 ( Fig. 2A

PPI network construction and co-expression analysis in patients with HCC
Next, we imported 142 genes into the STRING webpage software for PPI network construction. After setting the minimum required interaction score to highest confidence (0.900), we obtained 121 nodes and 398 edges. Then, a cluster network was created by using the MCL cluster algorithm in the STRING website. The first cluster included 16 genes (AURKA, CDKN3, CCNB2, CDC20, PTTG1, MELK, RACGAP1, PRC1, TOP2A, NUSAP1, ASPM, NCAPG, RFC4, PHLDA1, MCM6 and MCM2 (Fig. 3A). Next, we applied Cytohubba's MCC algorithm to rank and obtained the top 10 central genes. They were CDC20, CCNB2, AURKA, ASPM, NCAPG, NUSAP1, CDKN3, PRC1, MELK and TOP2A (Fig. 3B). Interestingly, these ten genes were all within the first cluster created by the MCL clustering algorithm.

Functional and pathway enrichment analyses
We used the Benjamini and Yekutieli method to adjust the P value of the enrichment analysis results, and used the standard of adj. P < 0.05 to screen for significant enrichment regions. GO terminology enrichment analysis indicated that 10 genes were mainly enriched in cell division, mitotic cell cycle process, DNA conformation change, positive regulation of apoptotic process, DNA-dependent ATPase activity, spindle pole and neural precursor cell proliferation. On the other side, KEGG pathway and Reactome gene sets based analysis revealed that the genes were mainly enriched in the cell cycle (Figs. 4A-4C). We used PPI network construction in Metascape, and extracted the most important MCODE components from it, and performed functional and pathway enrichment analysis for each MCODE component. The results indicated that the candidate genes of the cell cycle pathway may be indicators of prognostic factors in patients with HCC (Fig. 4D). In order to validate  that the cell cycle pathway was related to HCC, we performed GSEA on four databases, GSE25097, GSE14520, GSE36376 and GSE76427. The results showed that these data sets had good enrichment score and significance in the cell cycle pathway (Figs. 4E-4H).

The transcription level of Hub genes in HCC
In order to further verify whether the differentially expressed genes were overexpressed in HCC patients, the ONCOMINE database was used to compare the expression of those genes in tumor tissues and normal tissues. The analysis results showed that these 10 hub genes are overexpressed in liver cancer (Fig. 5A). Then, we used GEPIA2 to draw Box Plots to visualize the different mRNA expression levels of 10 hub genes between cancer samples and normal samples. The result revealed that the expression levels of these 10 genes in HCC tumor samples were higher than in normal samples (Figs. 5B-5K).

Establishment of the five-gene-based prognostic gene signature
We used the results of survival analysis to discover the prognostic value of hub genes respectively. The results showed that the OS, PFS and RFS of HCC samples with high expression of those single genes were worse than those with low expression (Fig. 6) ( Table 2). Next, the HCC data set of cBioPortal online platform (TCGA, firehose legacy) was applied to get prognostic information of 10 hub genes. The results point out that there is no significant correlation between the hub genes alteration and decreased OS (Fig. 7A). So, we develop a five-gene prognostic signature by LASSO Cox regression analysis. The five genes identified were Cell Division Cycle 20 (CDC20), Non-SMC Condensin I Complex Subunit G (NCAPG), Cyclin B2 (CCNB2), Assembly Factor for Spindle Microtubules (ASPM) and Nucleolus and Spindle Associated Protein 1 (NUSAP1). Then, we validated the prognostic significance of the prognostic signature in HCC patients by the online cBioPortal platform. The results showed that 5 genes were altered in 93 (25%) of 372 samples (Fig. 7C) and those genes alterations were significantly associated with decreased OS (Fig. 7B). We also calculated the five-gene based risk score for each patient from the TCGA-LIHC cohort and the GSE14520 cohort. The risk score = 0.12 * ExpressionCDC20 + (−0.082) * ExpressionCCNB2 +0.039 * ExpressionNCAPG +0.014 * ExpressionASPM + (−0.04) * ExpressionNUSAP1. The Survminer R package was performed to find the optimal cut-off the risk score. Patients in the high-risk score group shown significantly poorer OS than ones in the low-risk score group (from TCGA p < 0.001 and from GSE14520 p = 0.0003) (Figs. 7D, 7E). Our results indicated a good performance of the five-gene signature for survival predication.

DISCUSSION
The outcome of HCC patients is not only determined by tumor stage, tumor size, serum markers and liver function, but also closely related to some gene's expression in tumor tissue (Huitzil-Melendez et al., 2010). Several researchers have focused on the potential role of gene-signatures based on aberrant mRNA in prognosis prediction of HCC (Cao et al., 2020;Long et al., 2018;Liu et al., 2018;Wang et al., 2018;Kong et al., 2019). In this study, we screened out 10 hub genes by constructing a PPI network and using cytohubba. Enrichment analysis revealed that most of them are involved in cell cycle progression and survival analysis indicates that HCC patients with the abnormal expression of these genes showed poor OS and PFS. We established a 5-gene signature (including CCNB2, CDC20, NUSAP1, ASPM, and NCAPG) for HCC prognosis prediction. The prognosis predictive performance of the signature was good not only in the TCGA HCC cohort but also in the GSE14520 cohort. All these results indicated that the risk model developed from the five genes could be a useful indicator for HCC survival and to supplement the gap of the clinical prognostic signature of HCC. The enrichment analysis yielded that the most significant enrichment term was the cell cycle and mitosis. CCNB2 and CDC20 participated in the mitotic cell cycle process. CCNB2, CDC20, NUSAP1, ASPM, and NCAPG involved in cell division. There are reports that CCNB2 is highly expressed in HCC (Li et al., 2019). CCNB2 is involved in the development of HCC, which may be a prognostic factor (Li et al., 2019). Regulatory protein encoded by CDC20 plays an important role in the occurrence and development of a variety of tumors, which may be related to its participation in the function of anaphase-promoting complex/cyclosome (APC/C) interaction in the cell cycle. Meanwhile, the increase in CDC20 expression has also been shown to be related to the occurrence and development of HCC (Li et al., 2014;Liu et al., 2015). Because these evidences show that CCNB2 and CDC20 are directly related to the occurrence and development of HCC, we believe that they may play the role of initiator and promoter in the process of HCC. NUSAP1 is one of the most critical microtubule and chromatin binding proteins, which can play a role during mitosis to crosslink microtubules. NUSAP1 expression levels in HCC tissues were higher than those in the adjacent tissues. With the high expression of NUSAP1 in HCC patients, the survival time of the patients also showed a significant decreasing trend (Wang et al., 2019b). Unlike CCNB2 and CDC20, although NUSAP1 is highly expressed in HCC tissues with abnormal cell division, we have no direct evidence to prove its effect on the HCC process. Therefore, we believe that the highly expressed NUSAP1 may be temporarily used as a hepatocellular carcinoma Signs of disease. Increased expression of ASPM has been found in HCC. In addition, the expression level of ASPM has an important impact on the biological behavior of cancer cells or the prognosis of patients. ASPM overexpression is a molecular marker predicting poor prognosis (Lin et al., 2008). It has been reported that knocking out NCAPG can induce the division of HCC cells and even inhibit its deterioration in an in vitro environment. In contrast, the overexpression of NCAPG is also related to the recurrence of HCC patients (Zhang et al., 2018). Studies have shown that the prognostic effect of NCAPG makes it a new biomarker for predicting whether recurrence will occur after surgical removal of the tumor (Wang et al., 2019a). These evidences indicate that ASPM and NCAPG may be closely related to the prognosis of HCC patients, including positive, worsening or recurring prognostic results. Inhibiting the proliferation of HCC by targeted drugs to cause cancer cell apoptosis and curing cancer has become a new method of current cancer clinical treatment, such as Compound Kushen, which inhibits the cell cycle (Feitelson et al., 2015;Cui et al., 2019). However, it not only provides new biomarkers for the target treatment of HCC, but also provides a new plan for the clinical management of HCC.
In summary, consistent with our results, these 5 genes are all related to the prognosis of liver cancer. To our knowledge, the prognostic model associated with the five-gene signature may be a useful prognostic tool for liver cancer clinically. The risk score can be based on the mRNA expression levels of the five prognostic genes. In clinical practice, it may be more routine and cost-effective for all HCC patients. However, some limitations of our research should be considered. Firstly, we need to use more databases to verify the accuracy of this model. When we validated the five-gene signature by GSE14520 cohort, the median survival time of low-risk score group was undefined. It indicated that the follow-up time was not long enough in an original study. Secondly, at present, there are several reports that have screened out different genes signature, which needed to have a better and more accurate method to verify the effectiveness. Thirdly, expression profiling can only detect the change of gene expression level in our study, subsequent experiments are required for providing the information of those protein expression levels. Finally, we lack the molecular mechanisms of interaction between these genes, and we will incorporate these for further exploration.

CONCLUSIONS
This study aims to perform a comprehensive bioinformatics analysis of DEGs in four hepatocellular carcinoma datasets to discover potential biomarkers and predict their clinical effects. We established a five-gene signature (CCNB2, CDC20, NUSAP1, ASPM,