The Gene Signature Associated with Hepatocellular Carcinoma in Patients with Nonalcoholic Fatty Liver Disease

Background Nonalcoholic fatty liver disease (NAFLD) is becoming a critical risk of hepatocellular carcinoma (HCC). As both NAFLD and HCC are heterogeneous diseases, this study aims to identify the similarity between the subtypes of NAFLD and HCC based on gene modules. Methods Coexpressed gene modules were extracted for both NAFLD and HCC. The association between the coexpressed gene modules of NAFLD and HCC was evaluated by Fisher's exact test. The overlapping coexpressed gene module was validated in three independent human NAFLD datasets. Furthermore, the preserved gene module was assessed in four independent NAFLD mouse datasets. The significantly enriched motifs within the gene module were inferred from upstream sequences. Results Four coexpressed gene modules were extracted from NAFLD. Of the four coexpressed gene modules, one was significantly overlapping with a module of HCC. This overlapping gene module was regarded as the HCC-associated NAFLD gene module (HANM). Enrichment analysis of biological processes revealed inflammatory response in HANM. Specifically, within the inflammatory response biological process, IL-17, TNF-α, and NF-κB signaling pathways were enriched. HANM was found to be strongly or moderately conserved across four mouse NAFLD datasets. Motif analysis of the upstream genomic sequences of HANM revealed nine transcription factors (FLI1, NRF1, ZBTB33, ELK1, YY1, ZNF143, TAF1, SF1, and E2F1), of which three transcription factors (YY1, E2F1, and ZNF143) were significantly highly expressed in the NAFLD patients and exhibited survival significance in HCC. Conclusion This study demonstrated a robust way to identify the sharing gene signature between subtypes of NAFLD and HCC, which contributed to a comprehensive understanding of pathogenesis from NAFLD to HCC.


Background
Nonalcoholic fatty liver disease (NAFLD) is a spectrum of chronic liver disease spanning excessive cytoplasmic retention of triglyceride, steatosis, nonalcoholic steatohepatitis (NASH), and hepatic fibrosis and cirrhosis [1]. With the global pandemic of obesity, NAFLD is becoming a critical issue [2]. e prevalence of NAFLD increased from 15% to 25% during 2005-2010 [3]. Of NAFLD patients, approximately 2.4-12.8% will develop into progressive liver disease [4] including hepatocellular carcinoma (HCC). However, NAFLD-associated HCC risk has been largely underestimated because of misdiagnosis [5]. Recent studies have recognized NAFLD as the most common risk for HCC [6]. A USA cohort study has reported that 59% of HCC patients were associated with NAFLD or NASH [7]. Other studies from Germany [8], Italy [9], and Japan [10] have reported that 41.7-49% of HCC patients without cirrhotic background were related to NAFLD.
us, it is urgent to understand the associated mechanisms behind them.
NAFLD is a disease associated with the necroinflammatory process [11], DNA damage response [12], immune responses [13], and oxidative stress (reactive oxygen species, oxidation, and endoplasmic reticulum) [14]. However, both NAFLD and HCC are highly heterogeneous. Direct comparison between them with regular statistical tests could be problematic. Alternatively, weighted correlation network analysis (WGCNA) can subtype HCC and NAFLD patients and identify the underlying coexpressed gene modules at the same time [15]. e underlying coexpressed gene module of the associated HCC and NAFLD subtypes, hereafter, was named HANM (the HCC-associated NAFLD module). e construction of HANM has multiple applications. First, it could improve our understanding of the NAFLD-to-HCC conversion mechanism. Second, the identification of HANM could help to identify the patients with HCC risk.
In this study, we first identified the coexpressed gene modules of both NAFLD and HCC. Module overlapping analysis was used to find HANM. HANM was then validated in three independent NAFLD human datasets and four mouse datasets. Functions of HANM were further characterized with gene ontology, KEGG, and motif analysis.

Coexpressed Gene Module
Construction. R package WGCNA [15] was used to construct the coexpressed gene modules according to its manual (https://horvath.genetics. ucla.edu). e parameter, soft power, was chosen by a scale-free topology model fit. An unsigned correlation coefficient is regarded as the distance between two samples. A topological network was thus constructed, hierarchically clustered, and dynamically cut into multiple gene modules. Gene modules were merged according to Pearson's correlation coefficient between their eigengenes.

Gene Module Preservation Analysis.
To evaluate the clinical significance of coexpressed gene modules, the overall survival proportion was regressed against survival time with the Cox proportional hazard model. Expression over the median was defined as the high expression and that below the median as low expression. e Cox proportional hazard model was built with the R package "survival." A log-likelihood test was used to assess the significance. Modules with a higher survival probability were further enriched with gene ontology biological processes.
Human genes were mapped to mouse genes (orthologous genes) according to the Mouse Genome Informatics database (http://www.informatics.jax.org). Gene module preservation was estimated by averaging the several preservation statistics generated through 1000 permutations of the original data. A Z summary value is calculated, which summarizes the evidence that a module is preserved and indicative of module robustness and reproducibility. According to WGCNA recommendation [20], the preservation was defined as strong for Z summary > 10, moderate for 2 < Z summary <10, and weak for Z summary <2.

Gene Function Analysis.
ClusterProfiler [21] was utilized to perform the enrichment analysis of biological processes of gene ontology and KEGG pathways (https://www.genome.jp).

Searching Shared Transcription
Factors. iRegulon (http://iregulon.aertslab.org) [22] was used to perform the enrichment analysis of shared transcription factors in a gene module. As suggested, the transcription factors with a normalized enrichment score >3.0 were considered significant.

Survival Analysis.
e survival analysis was conducted with a web service [23]. e parameter cutoff of gene expression was set as "optimal." Risk factors (alcohol consumption and hepatitis virus) were set none. Totally, 91 samples were used in this analysis.

Coexpressed Gene Modules in NAFLD Patients.
e whole analysis procedure is illustrated in Figure 1. First, gene expression of a NAFLD dataset, GSE89632, was directly parsed from the prepared series matrix file and hierarchically clustered. Dynamic cutting and merging of the hierarchical clustering tree generated six coexpressed gene modules. ey were named with different colors automatically by the WGCNA program (Supplementary Figure 1). Module similarity was estimated by Pearson's correlation coefficient between module eigengenes. Modules with a similarity >0.75 were merged. Pearson's correlation coefficient between the eigengenes of each merged module and the NAFLD status is shown in Figure 2(a). ree gene modules (blue, yellow, and brown) were significantly associated with NAFLD. Gene ontology analysis of biological processes found no enriched biological process in the blue module ( Figure 2(b)). e yellow module was found enriched in the inflammatory response, positive regulation of the metabolic process, positive regulation of the nitrogen compound metabolic process, and regulation of the apoptotic process ( Figure 2(c), Supplementary Table 1). Specifically, of the inflammatory response biological process, IL-17, TNF-α, and NF-κB signaling pathways were enriched (Supplementary Table 2). e brown gene module is found enriched in the regulation of the nitrogen compound metabolic process, regulation of the cellular biosynthetic process, and cellular response to stress ( Figure 2(d)).

HCC-Associated NAFLD Coexpressed Gene Modules.
An HCC gene expression dataset was downloaded from TCGA (https://portal.gdc.cancer.gov/). Gene modules in HCC were constructed as NAFLD (Supplementary ese gene modules were then compared to the modules from the NAFLD dataset. Differential significance was assessed by Fisher's exact test (Figure 3(a)). Of the six gene modules from the NAFLD dataset, the yellow gene module was significantly associated with the gene modules (turquoise and brown) from HCC (p value <0.001).
e two gene modules from HCC were functionally similar to the yellow module of NAFLD in the biological processes (

Gene Module Consistency in Human Patients and Mouse Models with NAFLD.
e identified gene modules associated with HCC were further validated in three independent human NAFLD datasets. e three datasets (GSE59045, GSE126848, and GSE83452) were downloaded from the GEO database. A permutation test was used to calculate the empirical p value of the statistics Z summary , which indicated the preservation of the coexpressed gene network in one dataset against another. e preservation was defined as strong for Z summary > 10, moderate for 2 < Z summary < 10, and weak for Z summary < 2. e yellow gene module was found strongly preserved in two datasets and moderately conserved in the other one (Figure 4(a)).
We further validate these gene modules in mouse models. Four datasets (GSE137449, GSE128940, GSE114261, and GSE83596) from mouse models were compiled. Z summary scores indicated that the yellow module was the most conserved gene module between NAFLD human patients and mouse models. e yellow module was strongly conserved in GSE137449, moderately conserved in GSE128940 and GSE83596, and weakly conserved in GSE114261 (Figure 4(b)). e yellow module was, hereafter, named HANM (the HCC-associated NAFLD module, Supplementary Table 1).
We selected the samples with high correlation to the eigengene of HANM (Pearson's correlation coefficient>0.9, p < 0.05). ese samples were then compared to healthy samples by each HANM gene with the limma package (https://bioconductor.org). Fold change >1.8 and false discovery rate (FDR) < 0.05 were used as a cutoff. Totally, 131 differentially expressed genes were found. Of the nine transcription factors, six (ZBTB33, YY1, TAF1, E2F1, SF1, and ZNF143) were significantly highly expressed in the NAFLD patients with FDR < 0.05 and fold change > 1.8.
We further investigated the survival effect of these six differentially expressed transcription factors. Patients were split into two groups by the optimal cutoff of each gene [23]. A log-rank test was used to assess the survival differences between the two groups. Of the six differentially expressed transcription factors, YY1, ZNF143, and E2F1 were significantly associated with survival with p value � 0.0015, 0.0051, and 0.0015, respectively.

Discussion
In this study, we had inferred multiple coexpressed gene modules in NAFLD and HCC patients. Association analysis between NAFLD and HCC identified three overlapping gene modules (blue, yellow, and brown gene modules). Only in a part of NAFLD patients, the gene expression was positively correlated with the eigengene of HANM, which was reasonable since not all HCC patients came from NAFLD. In this sense, direct NAFLD vs. HCC comparison was defective to identify the faithfully associated genes between NAFLD and HCC.
Comparing the gene modules of NAFLD to those of HCC, the brown and yellow gene modules of NAFLD were found associated with the turquoise and brown modules of HCC, respectively.
ough the brown and yellow gene modules were overlapped in NAFLD and HCC, only the yellow gene module showed consistency in the independent NAFLD human and mouse datasets. us, we only considered the yellow gene module as HANM. Besides, HANM has the highest Z summary score in the choline-deficient, L-amino acid-defined high-fat diet mouse model (GSE137449), which implies the usefulness of this mouse model in the study of pathogenesis from NAFLD to HCC.
HANM was enriched in immune and metabolism-related biological processes including inflammatory response, positive regulation of the metabolic process, positive regulation of the nitrogen compound metabolic process, and regulation of the apoptotic process. ese biological processes had a high association with NAFLD-related HCC. Specifically, of the inflammatory response biological process, IL-17, TNF-α, NF-κB signaling pathways were enriched     Table 2). A mechanism study demonstrated that TNF-alpha and IL-6 can enhance HCC in NAFLD or obesity population [24], which could be activated by the elevated ER stress in NAFLD patients [25].
To further characterize HANM, we investigated the upstream transcription factors in accord to the enriched motifs upstream of HANM genes. Nine transcription factors were retrieved. Of the nine transcription factors, three transcription factors (FLI1, NRF1, and ZBTB33) were conserved in the mouse models. e function of FLI1 and ZBTB33 in NAFLD is still unclear. FLI1 is a protooncogene and can promote metastasis by regulating MMP2 signaling [26]. ZBTB33 is also found to be associated with diabetes mellitus and hepatocellular carcinoma [27]. NRF1 is clearly a NAFLD pathogenic gene. As reported, NRF1 is involved in mediating the activation of oxidative stress response genes Kaplan-Meier plot for E2F1 in human liver cancer. [28]. After liver-specific deletion of NRF1, mice could develop all NAFLD features including steatosis, fibrosis, cirrhosis, and liver cancer [29,30].
Of the nine transcription factors, six (ZBTB33, YY1, TAF1, E2F1, SF1, and ZNF143) were significantly highly expressed in the NAFLD patients with FDR < 0.05. YY1, E2F1, and ZNF143 also have survival significance in HCC. e three genes have a strong association with NAFLD according to the literature. It has been reported that YY1 is associated with NAFLD progression undergoing bariatric surgery [31]. e expression level of YY1 is significantly correlated with NAFLD biomarkers including serum glucose, insulin, HDL, LAT, and AST. E2F1 was considered a novel regulator of metabolism [32]. E2F1 deletion can completely abrogate hepatic steatosis in different murine models [33]. ZNF143 in NAFLD has been identified in an independent study [34]. ZNF143 plays a key role in the regulation of the metabolic network about cell survival and differentiation [35,36]. ese results demonstrated the reliability of the identified gene module, HANM.
It was noteworthy that NAFLD would express higher alcohol-metabolizing genes including ADH, ALDH, CYP2E1, and CAT [37]. But we failed to find any of those genes in HANM, which suggested that the alcohol-metabolizing genes, as characteristics of NAFLD, could not be the cause of further development of HCC.
In summary, this study identified an HCC-associated NAFLD module. It was then validated in multiple datasets from human and mouse models. Its functions were characterized by gene ontology, KEGG pathway, motif, and survival analysis. e identification of this HCC-associated NAFLD module could help to understand the pathogenesis from NAFLD to HCC. e data used to support the findings of this study are  available at NCBI GEO, accession numbers: GSE89632,  GSE59045, GSE126848, GSE83452, GSE137449, GSE128940,  GSE114261, and GSE83596. Disclosure e funder had no role in writing, editing, approval, or decision to publish the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest. this article.
is work was supported by the Medical Innovation Project of the Health and Family Planning Commission of Fujian Province (2018-CX-8).

Supplementary Materials
Supplementary Table 1   Journal of Oncology 7