Transcriptome analysis revealed key prognostic genes and microRNAs in hepatocellular carcinoma

Background Hepatocellular carcinoma (HCC) is one of the most common cancers worldwide. However, the molecular mechanisms involved in HCC remain unclear and are in urgent need of elucidation. Therefore, we sought to identify biomarkers in the prognosis of HCC through an integrated bioinformatics analysis. Methods Messenger RNA (mRNA) expression profiles were obtained from the Gene Expression Omnibus database and The Cancer Genome Atlas-Liver Hepatocellular Carcinoma (TCGA-LIHC) for the screening of common differentially expressed genes (DEGs). Function and pathway enrichment analysis, protein-protein interaction network construction and key gene identification were performed. The significance of key genes in HCC was validated by overall survival analysis and immunohistochemistry. Meanwhile, based on TCGA data, prognostic microRNAs (miRNAs) were decoded using univariable and multivariable Cox regression analysis, and their target genes were predicted by miRWalk. Results Eleven hub genes (upregulated ASPM, AURKA, CCNB2, CDC20, PRC1 and TOP2A and downregulated AOX1, CAT, CYP2E1, CYP3A4 and HP) with the most interactions were considered as potential biomarkers in HCC and confirmed by overall survival analysis. Moreover, AURKA, PRC1, TOP2A, AOX1, CYP2E1, and CYP3A4 were considered candidate liver-biopsy markers for high risk of developing HCC and poor prognosis in HCC. Upregulation of hsa-mir-1269b, hsa-mir-518d, hsa-mir-548aq, hsa-mir-548f-1, and hsa-mir-6728, and downregulation of hsa-mir-139 and hsa-mir-4800 were determined to be risk factors of poor prognosis, and most of these miRNAs have strong potential to help regulate the expression of key genes. Conclusions This study undertook the first large-scale integrated bioinformatics analysis of the data from Illumina BeadArray platforms and the TCGA database. With a comprehensive analysis of transcriptional alterations, including mRNAs and miRNAs, in HCC, our study presented candidate biomarkers for the surveillance and prognosis of the disease, and also identified novel therapeutic targets at the molecular and pathway levels.


INTRODUCTION
Liver cancers rank fourth in terms of cancer-related mortality and are the sixth leading cause of new incident cases worldwide (Villanueva, 2019). The 5-year survival rate of liver cancers is 18. 1% (Jemal et al., 2017) in the United States, while a worse outcome is reported in China, with a 5-year survival of 12.1% observed (Zheng et al., 2018). In China, approximately 466,100 new cases of liver cancers and approximately 422,100 liver cancer deaths occurred in 2015 (Chen et al., 2016b). Hepatocellular carcinoma (HCC), accounting for the 70-85% of liver cancer cases (Sia et al., 2017), mostly develops in the patients with chronic liver diseases, such as hepatitis B virus or hepatitis C virus (HBV or HCV) infection (Idilman et al., 1998), alcohol abuse (Morgan, Mandayam & Jamal, 2004) or non-alcoholic fatty liver disease (Dyson et al., 2014;Kanwal et al., 2016). Although resection, local ablation, transplantation, transarterial chemoembolization (TACE) and systemic therapies have been applied in the treatment of HCC (Forner, Reig & Bruix, 2018), the survival rate of HCC patients is still low, partly due to the high heterogeneity of HCC (Imamura et al., 2003). Thus, a comprehensive understanding of the transcriptional alterations may contribute to the development of preventive, diagnostic and therapeutic strategies for HCC.
With the advent of next-generation sequencing and microarray technologies at genome level, a large amount of RNA data has been generated and deposited in public databases, such as the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA), which enables investigators worldwide to identify differentially expressed genes (DEGs) and associated pathways involved in the onset and development of HCC and other diseases (Clough & Barrett, 2016;Lee, Tan & Chung, 2014). However, because of the heterogeneity of samples and microarray platforms, the results from different single cohort investigations may be inconsistent (Zhao et al., 2018). Therefore, an integrated bioinformatics analysis of multiple independent studies provides more reliable and robust results. Additionally, most analyses are based on the Affymetrix microarray profiles (Li et al., 2017), but considerably less is known about the datasets from Illumina microarray platforms.
Moreover, microRNAs (miRNAs) are short noncoding RNAs and regulate gene expression at the post-transcriptional level. They are involved in the regulation of key biological processes such as cell proliferation, metabolism, differentiation and apoptosis (Sticht et al., 2018). It has been reported that miRNAs contribute to the development of tumors (Iorio & Croce, 2017). However, associations of miRNAs with the dysregulation of gene expression in HCC and the prognosis of HCC patients require further investigation.
In this study, four transcriptome profiles from Illumina microarray platforms and one from TCGA database were collected to systematically identify a set of DEGs. Further analyses on gene ontology (GO), signaling pathways and protein-protein interactions (PPIs) were performed followed by key gene identification, overall survival (OS) analysis and immunohistochemistry validation. Meanwhile, miRNAs functioning as prognostic biomarkers were identified and the potential correlations of miRNA species with that of key genes were investigated.

Profile selection and Data Collection
The profiles in GEO databases were selected according to our selection criteria. The key words used were ''expression'', ''hepatocellular carcinoma'' and ''Illumina HumanHT-12''. For the 94 search results, the preliminary selection criteria were as follows: 1. Homo sapiens, 2. Gene expression profile; 3. All of the data were generated by an Illumina HumanHT-12 expression beadchip; 4. The control samples were collected from adjacent nontumor liver tissues; and 5. Samples were free of other therapies except radical resection. Among the eight profiles meeting the criteria, four datasets with the most samples were chosen for further analyses. The datasets were GSE36376, GSE39791, GSE57957, and GSE87630. The original Series Matrix Files of these four messenger RNA expression profiles were downloaded from the GEO database (https://www.ncbi.nlm.nih.gov/geo/). The microarray platform of the GSE36376, GSE39791, and GSE57957 datasets was GPL10558 (Illumina HumanHT-12 V4.0 expression beadchip, Illumina Inc.). The GSE87630 dataset was generated by GPL6947 (Illumina HumanHT-12 V3.0 expression beadchip, Illumina Inc.). A total of 415 HCC samples and 334 nontumor samples were included in these four datasets. The mRNA, miRNA and clinical data from TCGA-LIHC (https://cancergenome.nih.gov/) were also downloaded. The flow diagram of the study is shown in Fig. 1.

Data processing and identification of DEGs
For the mRNA data from GEO, background correction, quantile normalization and batch normalization were performed using R software (version 3.6.1, R Core Team, 2019). Based on the annotation information in the platforms, the probe sets were converted into the corresponding gene symbols. Probe sets without corresponding gene symbols were removed and the expression of genes with more than one probe set was averaged. The bioconductor (http://www.bioconductor.org) package ''limma'' was employed for DEG screening. P-value < 0.05 and |log 2 (Fold Change)|>1 were set as the cut-off criteria for the identification of DEGs in GEO datasets.
For TCGA data, the bioconductor package ''edgeR'' was applied to calculate expression alterations of genes and miRNAs. The cut-off criteria for differentially expressed genes and miRNAs were consistent with those for GEO data.
The common DEGs identified in both databases were chosen for further analyses. The Venn diagram was generated by FunRich (version 3.1.3, http://www.funrich.org/). Metascape (Zhou et al., 2019) (http://metascape.org) was utilized for molecular function (MF), cellular component (CC) and biological processes (BP) enrichment analyses of the common DEGs at the functional level. Pathway enrichment analysis was performed based on three databases in Metascape including KEGG Pathway, Reactome and BioCarta. A P-value < 0.05 was defined as significant in both GO and pathway enrichment analyses. The R package ''ggplot2 was employed to generate the bubble plots.

PPI network construction and cluster identification
The PPI network of DEGs was constructed using STRING (version 11.0, http://stringdb.org) and a combined score > 0.7 (high confidence) was set as the cut-off criterion. Cytoscape (version 3.7.2, https://cytoscape.org/), an open source platform, was applied to the visualization of molecular interaction networks based on the results from STRING. The Molecular Complex Detection (MCODE) plugin (version 1.5.1) in Cytoscape was used for identifying densely interconnected clusters. The selection criteria were as follows: degree ≥ 2, node score ≥ 0.2, K-core ≥ 2, and max depth = 100.

Key gene selection and further analyses
The top 11 hub genes in the PPI network with most connections were defined as key genes. The survival analyses of the key genes were performed with the aid of the Kaplan-Meier Plotter database (http://www.kmplot.com). Moreover, the protein expression and distribution in HCC and nontumor samples were displayed with the Human Protein Atlas (HPA) database (Uhlen et al., 2015) (https://www.proteinatlas.org/).

Prognostic signature of miRNAs and target key genes
The R package ''survival'' was applied for univariable analysis of differentially expressed miRNAs for overall survival. Then, miRNAs with a P-value < 0.1 in univariate analysis were enrolled in multivariable Cox regression analysis using IBM SPSS statistics 22 and a P-value < 0.05 was set as the selection criterion. The prognostic model was visualized by a nomogram. Next, Kaplan-Meier plotter analysis was performed to estimate the survival difference between the low-risk and high-risk groups, and 3-year and 5-year overall survival were displayed by ROC curves based on the prognostic model. Moreover, miRWalk (version 3.0, http://mirwalk.umm.uni-heidelberg.de/) was used to determine the overlap between predicted target genes of prognostic miRNAs and key genes identified previously.

DEG identification in HCC
After standardization of the data in GSE36376, GSE39791, GSE57957, and GSE87630, 344 DEGs (117 upregulated genes and 227 downregulated genes) were identified (Table S1). Meanwhile, 8976 DEGs (7425 upregulated genes and 1551 downregulated genes) were identified in the TCGA dataset (Table S2). The distribution patterns of expressed genes in GEO and TCGA samples are displayed in Figs

GO enrichment analysis of DEGs
Metascape was used for GO biological process, cellular component, and molecular function analyses. The top 10 GO terms with the lowest P-value in each group are exhibited in Fig. 3. In terms of molecular functions, the DEGs were significantly enriched in metabolic processes, including monocarboxylic acid, steroid and alcohol metabolic activities, as well as oxidoreductase activity and monooxygenase activity. For the cellular component group, ''blood microparticle,'' ''vesicle lumen'' and ''extracellular matrix'' were the main enriched categories. Specifically, changes in biological processes of the upregulated DEGs significantly focused on chromatid segregation, cell division and nuclear division while metabolic processes accounted for the majority of enriched categories with regard to the downregulated DEGs (Table 2).

Signaling pathway enrichment analysis of DEGs
According to the databases of KEGG Pathway, Reactome and BioCarta gathering in Metascape, biological oxidations and metabolism-related pathways were the main enriched pathways in DEGs (Fig. 4). As shown in Table 3, extracellular matrix organization, cell cycle and collagen formation were significantly enriched in upregulated DEGs. For downregulated DEGs, the top enriched signaling pathways were biological oxidations, retinol metabolism and metabolism of lipids.

PPI and cluster analysis
Based on the published literature collected by STRING, 188 DEGs were filtered in the PPI network, and 486 edges are displayed in Fig. 5A. Furthermore, the MCODE plugin was applied to investigate highly interconnected regions, known as clusters, in the PPI network. The three clusters with the highest scores are shown in Figs. 5B, 5C and 5D, respectively. Cluster 1, consisting of 13 nodes and 72 edges, was highly enriched in cell division, nuclear division, cell cycle, chromosome segregation and condensation (  Table S3). Twenty-three nodes and 77 edges were engaged in cluster 2 and were mainly associated with the epoxygenase P450 pathway and metabolic processes (Fig. 5C, Table  S4). For Cluster 3, metallothioneins bind metals, and the response to metal ions and detoxification were significantly enriched (Fig. 5D, Table S5).

DEGs
Gene names low abundance. Therefore, AURKA, PRC1, TOP2A, AOX1, CYP2E1, and CYP3A4 can be considered candidate liver-biopsy markers for high risk of developing HCC and poor prognosis in HCC.

Prognostic miRNA identification and survival analysis
To further investigate the connections between miRNAs and gene expression, prognostic analysis was also performed in miRNAs. A total of 409 miRNAs were identified as differentially expressed miRNAs (Table S6) and were included in univariate analysis. Forty-four miRNAs with a P-value < 0.1 in univariate analysis (Table S7) were selected as hits and were enrolled in further multivariate Cox regression analysis. Finally, seven miRNAs (hsa-mir-1269b, hsa-mir-139, hsa-mir-4800, hsa-mir-518d, hsa-mir-548aq, hsamir-548f-1, hsa-mir-6728) were identified as prognostic factors (P-value < 0.05) impacting the survival of HCC patients (Table 6). Based on the seven miRNAs, a prognostic nomogram predicting 3-year and 5-year survival is shown in Fig. 8A. The cut-off score of low and high  (Table S8). Meanwhile, the survival curves of patients with low risk and high risk, and the time-dependent ROC curves at 3 years (AUC = 0.76) and 5 years (AUC = 0.794) are plotted in Figs. 8B and 8C, respectively. Moreover, according to miRWalk, most key genes identified were potential target genes (score > 0.8) of these prognostic miRNAs (Table 6).

DISCUSSION
In the present study, 262 aberrantly expressed genes in HCC samples were disclosed by integrated bioinformatics analyses. The functional and signaling pathway enrichment analysis indicated the underlying mechanisms in HCC development. Among these genes, 11 key genes were identified through PPI analysis and their prognostic value in HCC patients was validated by OS analysis and immunohistochemistry. Meanwhile, univariate and multivariate Cox regression analysis revealed the prognostic application of seven miRNAs. Interestingly, according to miRWalk, most key genes were potentially regulated by these miRNAs. Genomic, epigenetic and transcriptional alterations, as well as subsequent multistep processes, play crucial roles in the occurrence, development, metastasis and prognosis of HCC (Dhanasekaran et al., 2019;Llovet et al., 2018). Combined with public biological databases (e.g., GO and KEGG), the development of high-throughput detection technology provides us with an opportunity to systematically explore ''interesting'' gene lists on a genome-wide scale and sort out correlative biological processes (Shan, Chen & Jia, 2019; Shen et al., 2019). To identify reproducible and robust genetic alterations, integrated analyses based on multiple datasets gradually replace single cohort analyses. Thus, in the present study, we recruited four datasets from GEO and one dataset from TCGA, including 789 HCC samples and 384 nontumor samples in total, to perform a meta-analysis and identify consensus alterations. Notably, all data from GEO in our study were generated by Illumina expression beadchip platforms, and our study was the first large-scale integrated bioinformatics analysis of the data from Illumina BeadArray platforms and TCGA database. According to the functional, signaling pathway and protein-protein interaction analyses, DEGs had noticeable impacts on multiple cellular activities, including DNA structure, cell division, metabolism, and microtubule cytoskeleton organization. Three clusters identified in PPI network mainly functioned in cell division and metabolic processes. These findings were consistent with our knowledge (Jiang et al., 2015;Kastan & Bartek, 2004;Liu et al., 2016;Loong & Yeo, 2014;Shoieb et al., 2019). Among 30 hub genes, upregulated ASPM, AURKA, CCNB2, CDC20, PRC1 and TOP2A and downregulated AOX1, CAT, CYP2E1, CYP3A4 and HP, were considered key biomarkers in HCC. Overall survival analysis indicated that the aberrant expression of these genes correlated with a poor prognosis. With the aid of immunohistochemistry, AURKA, PRC1, TOP2A, AOX1, CYP2E1, and CYP3A4 were considered candidate liver-biopsy markers for HCC. According to the instruction on HPA database, all images were manually annotated by a specialist followed by verification by a second specialist using fixed guidelines. Basic annotation parameters include an evaluation of staining intensity (negative, weak, moderate or strong), fraction of stained cells (<25%, 25-75% or >75%) and subcellular localization (nuclear and/or cytoplasmic/membranous). Thus, they were deemed reliable because of the homogeneous processes and fixing guidelines, although all immunohistochemistry images were manually annotated. However, the significance of these proteins in HCC requires further validation due to the limited number of samples in HPA, unpaired comparison and lack of statistical analysis. The overexpression of ASPM, which contributes to neurogenesis and cell proliferation, is an independent risk factor for early tumor recurrence regardless of p53 mutation status and poor prognosis of HCC (Lin et al., 2008). TOP2A, which is a mitotic gene and highly expressed at the G2/M phase, correlates with early age onset, shorter patient survival and chemoresistance (Wong et al., 2009). TRRAP/KAT5, which activates TOP2A, has been reported to inhibit HCC cell growth through induction of p53-independent and p21-independent senescence (Kwan et al., 2020). CCNB2, CDC20 and PRC1 are the three most commonly reported upregulated genes in HCC through bioinformatics analyses (Chen et al., 2016a;Gao et al., 2018;Li et al., 2014;Liu et al., 2018;Wang et al., 2019b). CCNB2, as a component of the cell cycle regulatory machinery, is associated with the Golgi region (Draviam et al., 2001;Jackman, Firth & Pines, 1995) and plays an important role in regulating the G2/M transition (Gui & Homer, 2013;Li et al., 2018a). CDC20 is essential to chromosome segregation and mitotic exit and regulates the cell cycle at multiple time points (Kapanidou, Curtis & Bolanos-Garcia, 2017). PRC1 exerts an impact on the formation of microtubule architectures and subsequent cell shape formation and cytokinesis regulation and promotes early recurrence of HCC associated with the Wnt/ β-catenin signaling pathway (Chen et al., 2016a). AURKA promotes cancer metastasis in HCC by inducing epithelial-mesenchymal transition and cancer stem cell behaviors through the PI3K/AKT pathway (Chen et al., 2017). Alisertib, an AURKA inhibitor, has been demonstrated to potently inhibit cell viability and induce apoptosis in HCC cells (Li et al., 2018b). The elevated expression levels of PRC1, TOP2A, and AURKA in HCC tissue were also confirmed by immunohistochemistry and appear to be candidate biopsy markers for high risk of developing HCC and poor prognosis in HCC patients. It has been reported that AOX1 interacts with ABCA1 to regulate ABCA1-related cellular functions such as lipid efflux and phagocytosis in hepatocytes and is highly expressed in normal human liver tissue (Sigruener et al., 2007). In contrast, reduced expression of AOX1 is detected in HCC cells and is highly correlated with higher tumor stage, distant metastases or positive lymph node status (Sigruener et al., 2007). Haptoglobin (HP) is an acidic glycoprotein tetramer that is mainly secreted in the liver. Its expression is low during the active proliferation of progenitor cells but increases when the cells reach confluency (Guillouzo et al., 2007). CAT, a well-described oxidative stress biomarker, plays an important role in inflammation and the prevention of apoptosis and serves as a growth promoting factor in a wide spectrum of tissues (Goyal & Basak, 2010;Miyamoto et al., 1996). Hence, it is reasonable for CAT to be a key biomarker in hepatocarcinogenesis and development. Both CYP2E1 and CYP3A4, as members of the cytochrome P450 family which contributes to drug metabolism, lipid synthesis and homeostasis (Manikandan & Nagini, 2018), have been reported to be less expressed in HCC (Chen et al., 2014;Hu et al., 2019). Multivariate Cox regression analysis offered a prognostic model involving seven miRNAs. By reviewing the published literature, we found that hsa-miR-1269b and hsa-miR-139 had been reported to be dysregulated in HCC. Hsa-miR-1269b promotes malignancy in HCC by upregulating cell division cycle 40 homolog (Kong et al., 2016). Downregulation of hsa-miR-139 is associated with poor outcome (Wang et al., 2019a) and the long noncoding RNA SNHG3/miR-139-5p/BMI1 axis may be one of the potential signaling pathways (Wu et al., 2019). However, the correlative mechanisms and pathways of hsa-miR-4800, hsa-miR-518d, hsa-mir-548aq, hsa-mir-548f-1 and hsa-mir-6728 in HCC have not been determined. By executing the TarPmiR algorithm for target site prediction in miRWalk, most key genes were considered candidate target genes of the indicated miRNAs and the same mRNA was likely to be controlled by various miRNAs, which is a common phenomenon. Currently, the interactions between these molecules lack solid supports, and experimental evidence is required to elucidate the underlying mechanisms. The emergence and advances in the bioinformatics field accelerate the development of biology. Bioinformatics tools provide opportunities for handling big data that are impossible to manage manually. However, the heterogeneity of how data are generated, assembled, annotated and displayed will substantially affect the results. It is a common limitation of bioinformatics analyses. Meanwhile, the clinical features of the patients such as HBV/HCV infection, age, alcohol consumption and cancer stages were not analysed in our study. These factors may have important impacts on gene and miRNA expression. The correlation between biomarkers and cancer stages also requires further research to provide useful references and insights into research and the clinic. Moreover, in-depth research should be conducted with the combination of traditional biology technologies to validate the findings from bioinformatics analyses, particularly the interactions between miRNAs and potential target mRNAs.

CONCLUSIONS
The present study systematically analyzed multiple transcriptome profiles and summarized a list of differentially expressed genes and miRNAs in HCC. To the best of our knowledge, no such large-scale investigation on Illumina BeadArray platforms was performed previously.
Eleven key genes and a 7-miRNA model provide biomarkers for the surveillance and prognosis of HCC. This study decoded the alterations in HCC at the molecular and functional levels and offered potential targets for therapy.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This study was supported by grants from the National Science and Technology Major Project (2017ZX10203205). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors: The National Science and Technology Major Project: 2017ZX10203205.