A twenty gene-based gene set variation score reflects the pathological progression from cirrhosis to hepatocellular carcinoma

The molecular mechanism of the pathological progression from cirrhosis to hepatocellular carcinoma (HCC) remains elusive. In the present study, tissue samples from normal liver, cirrhosis and HCC were subjected to differentially gene expression analysis, weighted gene correlation network analysis to identify the twenty hub genes (TOP2A, CDC20, PTTG1, CDCA5, CCNB2, PRC1, KIF20A, SF3B4, HSP90AB1, FOXD2, PLOD3, CCT3, SETDB1, VPS45, SPDL1, RACGAP1, MED24, KIAA0101, ZNF282, and USP21) in the pathological progression from cirrhosis to HCC. Each sample was calculated a hub gene set variation analysis (HGSVA) score using Gene Set Variation Analysis, The HGSVA score significantly increased with progression from cirrhosis to HCC, and this result was validated in two independent data sets. Moreover, this score may be used as a blood-based marker for HCC and is an independent prognostic factor of recurrence-free survival (RFS) and overall survival (OS). High expression of the hub genes may be driven by hypomethylation. The twenty gene-based gene set variation score may reflect the pathological progression from cirrhosis to HCC and is an independent prognostic factor for both OS and RFS.


INTRODUCTION
Hepatocellular carcinoma (HCC) is the most common type of primary liver cancer and the third leading cause of cancer-related death worldwide [1]. Its clinical characteristics show a gradually increasing incidence and a poor prognosis. The recurrence rate is 42.9% within two years after curative treatment [2]. Although many drugs have been tested or under exploration [3][4][5][6], none have been confirmed to increase recurrence-free survival (RFS). Therefore, it is very important to study the pathological mechanism of HCC.
The incidence of HCC, which can be caused by various factors, varies widely worldwide, and this disease is often associated with chronic HBV or HCV infection. The pathogenesis of HCC is very complex, comprising multiple genetic and epigenetic alterations, chromosomal aberrations, gene mutations, and altered molecular pathways [7]. Chronic liver injury, inflammation, hepatocyte degeneration/regeneration, cirrhosis, necrosis and small cell dysplasia can be observed during multistep hepatocarcinogenesis [8].
There are various causes of cirrhosis, such as excessive alcohol consumption, contact with or consumption of Aspergillus toxins, and various metabolic disorders [9], but the mechanism of progression from cirrhosis to HCC remains elusive. Therefore, there is an urgent need to identify the hub gene or gene set dictating the progression from cirrhosis to HCC and to develop potential therapeutic targets.
In the present study, twenty hub genes were identified by WGCNA. The hub gene set variation analysis (HGSVA) score increased significantly with the progression from cirrhosis to HCC and was validated in two independent data sets. In addition, the HGSVA score is an independent prognostic factor for overall survival (OS) and RFS.

Differentially expressed genes (DEGs) in cirrhosis and HCC
A total of 3525 DEGs, 1793 downregulated and 1732 upregulated, were identified in cirrhosis and HCC samples compared to control samples ( Figure 1A). Cluster analysis showed that these DEG patterns could basically distinguish cirrhosis and HCC tissues from normal liver tissues ( Figure 1B).

Module associated with the progression from cirrhosis to HCC
To find the key module most associated with the progression from cirrhosis to HCC, WGCNA was performed using the expression profile of the DEGs. The power was 7, which was the lowest value for the scale with an independence degree of up to 0.90 ( Figure  2A). Three modules were identified ( Figure 2B). The red module was positively correlated with phenotype (correlation coefficient = 0.86, P = 4E-26; Figure 2C), while the green module was negatively correlated with phenotype (correlation coefficient = -0.71, P = 1E-14; Figure 2C). This finding indicated that with the progression from cirrhosis to HCC, the expression patterns of the red module genes showed an increasing trend, while those of the green module genes showed a decreasing trend. The red module was selected as our candidate module in which to identify hub genes. The module membership (MM) and gene significance (GS) were highly associated in the red module ( Figure 2D).

Key pathways in the progression from cirrhosis to HCC
Red module genes were significantly involved in biological processes related to cell proliferation and differentiation ( Figure 3A) and in cirrhosis-or HCCrelated pathways, such as hepatitis B and the TNF and P53 signaling pathways ( Figure 3B). Green module genes were significantly enriched in biological processes related to T cells, leukocytes and lymphocytes ( Figure 3C) and in the MAPK signaling pathway ( Figure 3D).

The HGSVA score may be a blood-based marker and independent predictor of OS and RFS in HCC
The HGSVA score in peripheral blood samples was significantly higher for patients with HCC than for healthy controls ( Figure 4E). ROC curve analysis showed that the HGSVA score has potential as a bloodbased marker for HCC (AUC=0.850, Figure 4F). The HGSVA score was significantly associated with the OS ( Figure 5A) and RFS of patients with HCC ( Figure 5B). This result was consistent with that for GSE49515 ( Figure 5C). Among patients with HCC, those in the high HGSVA score group had a shorter OS and RFS. (B) The HGSVA score in GSE6764. (C) The HGSVA score in the TCGA HCC data set (phenotype defined by pathological stage). (D) The HGSVA score in the TCGA HCC data set (phenotype defined by tissue grade). (E) The HGSVA score in GSE6764. (F) ROC results: the AUC is significant, which indicates that the gene set can be used for the early diagnosis of HCC. AGING than those in the low score group. Furthermore, univariate and multivariate Cox analyses showed that the HGSVA score was a significantly independent factor for both OS (Table 1) and RFS (Table 2) after adjusting for routine clinicopathological features Hub genes are also highly expressed at the protein level in HCC Seventeen identified genes are included in The Human Protein Atlas, and all were highly expressed in HCC compared to normal liver ( Figure 6). The antibodies against these proteins are listed in Supplementary Table  1. These data provide strong support for our results. FOXD2, MED24 and KIAA0101 are not in The Human Protein Atlas.

The upregulation of hub genes by hypomethylation
The 20 hub genes were scanned for genetic alterations ( Figure 7A), and a few samples had mutations (9.34%). This finding indicated that mutation may not be the cause of the differential gene expression. Compared to normal liver tissue, HCC tissue harbored one or more differentially methylated sites in the 20 hub genes; therefore, the upregulation of the hub genes was driven by hypomethylation ( Figure 7B).

DISCUSSION
Cirrhosis of any etiology increases the risk of HCC [10][11][12]. The sequential progression of cirrhosis to HCC is characterized by nodular lesions, such as low-and high-grade dysplastic nodules, which progress to early HCC and then to intermediate/advanced HCC. The slow progression from cirrhosis to HCC may be the key to preventing HCC, but the mechanism underlying this progression is not well understood. In our present study, we screened DEGs between cirrhosis/HCC and normal liver to perform WGCNA. Two functional modules were identified: one module (red module) was significantly positively correlated with the progression from cirrhosis to HCC, and the other module (green module) was significantly negatively correlated with    this progression. This result suggests that liver disease progression is driven by multiple genes rather than a single gene. The red module genes were significantly involved in several cancer-related pathways, such as the cell cycle and the TNF and P53 signaling pathways, and in viral carcinogenesis and the hepatitis B pathway. HBV and HCV are the main viral pathogenic factors of cirrhosis and HCC.
Previous studies have focused on the role of a single gene or molecule. In the present study, twenty genes were identified as hub genes. Unsurprisingly, some have been associated with HCC, such as CD20 [13], PTTG1 [14], CDCA5 [15], HSP90AB1 [16], CCT3 [17], SF3B4 [16], SETDB1 [17], VPS45 [18], RACGAP1 [19], USP21 [20,21], and KIF20A [22]. In addition, we also found that MED24, FOXD2 and ZNF282 are associated with the progression from cirrhosis to HCC. However, no single molecule is currently widely identified or targeted in the clinic, perhaps due to the heterogeneity of HCC. Thus, we used GSVA to score a sample rather than a gene (molecule). The HGSVA score significantly increased with the progression from cirrhosis to HCC and was validated in two independent data sets. This increasing trend was more pronounced upon the formation of HCC. Notably, the trend was not obvious before HCC formation in GSE6764, potentially because of the etiology of HCV in GSE6764. This finding indicates that the pathological progression from cirrhosis to HCC is complex and related to the etiology.
ROC analysis revealed that the HGSVA score had a high AUC (0.850) and that it may be a blood-based marker for HCC. This score may help diagnose patients with ambiguous imaging results. The HGSVA score was associated with prognosis; a higher score indicated a shorter OS and RFS. The results of the univariate and multivariate Cox regression analyses showed that the HGSVA score is an independent prognostic factor after adjusting for routine clinicopathological features. Surveillance is cost-effective and dictated by the incidence of HCC [23,24]. High-score patients may be followed up more frequently than low-score patients based on the prognostic stratification system.
We identified mutations in the 20 hub genes in a few patients. Most of the genes had low methylation levels. Aberrant DNA methylation plays a crucial role in carcinogenesis [25,26]. Therefore, we speculate that hypomethylation may be one cause of the upregulation of the 20 hub genes. The specific mechanism should be further explored, and further experimental verification is needed.
Herein, we provide new insights into the progression from cirrhosis to HCC. This study may represent the first report of a score for cirrhosis and HCC samples using a hub gene set. However, several limitations of the present study should be noted. First, the molecular mechanisms of these 20 genes in multistep hepatocarcinogenesis are not yet clearly understood, and it is unknown whether HGSVA is a causal factor or merely a marker of the pathological progression from cirrhosis to HCC. Second, the HGSVA score still needs to be confirmed in prospective trials before clinical application. Third, GSVA is more expensive, so further cost reductions are necessary before it can be applied in the clinic. In conclusion, we identified and validated a hub gene set in the pathological progression from cirrhosis to HCC. The twenty genebased gene set variation score reflects the pathological progression from cirrhosis to HCC and is an independent prognostic factor for both OS and RFS.
The gene expression profiles in GSE6764 [28] are based on GPL570, including 10 normal controls, 10 cirrhotic liver tissues, 3 cirrhotic liver tissues from patients without HCC, 9 low-grade dysplastic liver tissues, 7 high-grade dysplastic liver tissues, 8 very early HCC, 10 early HCC, 6 advanced HCC, and 9 very advanced HCC. The whole blood gene expression profiles in GSE49515 [29] are based on GPL570, including 10 HCC and 10 normal controls. The gene expression profiles in GSE54236 [30] are based on GPL6480, including 81 HCC and 80 normal controls. The normalizeBetweenArrays function in the limma package [31] was used to normalize the gene expression profiles. If a gene corresponded to multiple probes, the average expression value of these probes was chosen as the expression value of the gene. Eight low-grade chronic hepatitis and 12 high-grade chronic hepatitis samples in GSE89377 and 3 cirrhotic liver tissues from patients without HCC in GSE6764 were removed from the analysis in the present study. RNA sequencing (displayed as read count) and clinical information of HCC were downloaded from The Cancer Genome Atlas (TCGA, https://www.cancer.gov/) [32]. The workflow of the present study is shown in Figure 8.

Differentially expressed gene (DEG) analysis
The DEGs in cirrhosis and HCC samples (cirrhosis, low-grade dysplastic nodules, high-grade dysplastic nodules, and HCC) compared to the normal liver samples were screened using the limma package in R and GSE89377. The fold changes (FCs) in the expression of individual genes were calculated, and genes with |log2FC| > 1 and P < 0.05 adjusted by the false discovery rate (FDR) were considered significant.

Weighted gene correlation network analysis (WGCNA) in GSE89377
We extracted the expression profile of DEGs in GSE89377 to perform WGCNA [33]. The phenotypes were converted to numbers for the analysis: 1 indicates normal liver, 2 indicates low-grade chronic hepatitis, 3 indicates high-grade chronic hepatitis, 4 indicates cirrhosis, 5 indicates low-grade dysplastic nodules, 6 indicates high-grade dysplastic nodules, 7 indicates early HCC, 8 indicates grade 1 HCC, 9 indicates grade 2 HCC, and 10 indicates grade 3 HCC. First, hierarchical clustering analysis was performed using the hclust function. Then, the soft thresholding power value was screened during module construction by the pickSoftThreshold function. Candidate power (1 to 30) was used to test the average connectivity degrees of different modules and their independence. A suitable power value was selected if the degree of independence was > 0.9. The WGCNA R package was used to construct coexpression networks (modules); the minimum module size was set to 30, and each module was assigned a unique color.

Functional enrichment analysis
To explore the biology of the gene modules, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed using the clusterProfiler package [34] in R. P < 0.05 was considered significant.

Identification of hub genes and calculation of the HGSVA score
In WGCNA, the module eigengene was the first principal component of the expression matrix for a module. The module eigengene was considered an average gene expression value for genes in a module. Phenotype was converted into a numerical value, and a regression analysis was performed between the module eigengene values and the phenotype. Module membership (MM) was defined as the association between a gene and its module, and gene significance (GS) was defined as the correlation of a gene with a phenotype. Genes with a high GS and MM were considered to be hub genes in the module. In the present study, a gene with GS > 0.7 and MM > 0.8 was considered a hub gene. Thus, multiple hub genes formed the hub gene set. Gene set variation analysis (GSVA) [35] was used to score individual samples against the hub gene set, and each sample received an HGSVA score.

Validation of the HGSVA score, ROC curve analysis and univariate/multivariate Cox proportional hazards analyses
The HGSVA scores of all HCC samples were calculated in GSE6764, GSE49515, GSE54236 and the TCGA as in GSE89377. The GSE6764 and TCGA data sets were used to validate the increasing trend in the HGSVA score with the progression from cirrhosis to HCC. In addition, ROC curve analysis was conducted using the pROC package [36] to assess the diagnostic value of the peripheral blood HGSVA score for HCC in GSE49515. Patients with HCC in the GSE54236 and TCGA data sets were separated into high-and low-score groups based on the median HGSVA score. Kaplan-Meier survival analysis with the log-rank method was performed to explore the association of HGSVA score with recurrence-free survival (RFS) or overall survival (OS) using the survival package (https://CRAN.Rproject.org/package=survival) in R. Moreover, univariate/multivariate Cox proportional hazards analyses were applied to compare the relative prognostic value of the HGSVA score with that of routine clinicopathological features in the TCGA.

Validation of the differential expression of hub genes at the protein level
The Human Protein Atlas (https://v15.proteinatlas.org/) [37] contains information for a large majority of human protein-coding genes, including their RNA and protein expression and localization. We scanned The Human Protein Atlas web tool to validate the differential expression of the hub genes at the protein level.

Mutation and methylation analyses
To explore the potential mechanism of the differential expression of the hub genes, we scanned these genes for mutations and methylation profiles. The TCGAbiolinks package [38] was used to download and scan the mutation status of the hub genes. In addition, we explored the DNA methylation of these genes using Wanderer (http://maplab.imppc.org/wanderer/) [39], which is an intuitive network tool that can be used to retrieve DNA methylation and gene expression data for different tumor types in TCGA databases.

CONFLICTS OF INTERESTS
The authors declare that there are no conflicts of interests.