Combined identification of ARID1A, CSMD1, and SENP3 as effective prognostic biomarkers for hepatocellular carcinoma

Background: The current study aimed to understand the genetic landscape and investigate the diagnostic and prognostic biomarkers of primary hepatocellular carcinoma (HCC). Methods: A cohort of 36 Chinese HCC samples with hepatitis B virus (HBV) infection was examined by whole-exome sequencing (WES). Prognosis-related alterations were identified and further verified in the TCGA database and GSE65372 profiles in the GEO database. A Chinese replication cohort of 180 HCC samples with HBV infection was collected to evaluate the candidate genes by immunohistochemical analysis. A receiver operating characteristic (ROC) curve analysis evaluated the prognostic power of candidate genes. Finally, EdU and transwell invasion assay were performed to detect the function of candidate genes. Results: A total of 11 novel genes showed a significant association with HCC in the discovery cohort. The data were verified using the GEO and TCGA databases, and the expression of ARID1A, CSMD1, and SENP was evaluated in the replication cohort. Furthermore, ARID1A, CSMD1, and SENP3 are effective prognostic biomarkers for HCC patients in the replication population. Conclusions: Molecular heterogeneity was detected in HCC patients, and ARID1A, CSMD1, and SENP3 were identified as effective HCC prognosis biomarkers. CSMD1 prevents HCC by suppressing cell invasion.

AGING considered an essential part of HCC treatment. Therefore, knowledge of the HCC gene landscape is not only imperative to our understanding of HCC heterogeneity but also to explore effective diagnostic and prognostic HCC biomarkers [3,6,7].
Although previous studies have identified specific HCC risk factors, such as CTNNB1, TP53, AXIN1, and CNKN2A [8,9], the most prevalent diagnostic and prognostic genes for HCC are yet unknown [3]. The present study investigated the diagnostic and prognostic biomarkers of HCC patients in the Chinese population by comprehensive analysis of gene variations and expression in a discovery cohort of 36 HCC samples with HBV-infection and a replication cohort of 180 HCC samples. The gene landscapes of HCC patients with HBV infection were identified, and the resulting prognostic genes were evaluated.

Genetic mutation landscape of HCC in the discovery population
Patient clinical characteristics are presented in Table 1. Briefly, all subjects were males with an average age of 48.33 ± 9.75 years. All patients had a history of HBV infection. The sample collection, sequencing, and data analysis are shown in Figure 1A. To identify the landscape of genetic mutations of HCC patients with HBV infection, WES was performed in 36 pairs of HCC samples; a total of 4231 somatic SNVs, 192 somatic indels, and 12 somatic CNVs were detected. The mean number of SNVs, indels, and CNVs in each patient was 117.5, 5.3, and 0.3, respectively ( Figure 1B). According to the frequency of sample mutations, 8 genes (TP53, MUC16, CTNNB1, TTN, ARID1A, PCLO, NBPF10, and CSMD1) were identified with a high mutation frequency (> 15%) in HCCs ( Figure 1C). TP53 is the most frequently occurring gene (61.1%). Compared to the most frequently mutated genes in HCC from the TCGA database, NBPF10 and CSMD1 were highly mutated genes in HCCs in the Chinese population. Furthermore, KEGG pathway analysis (P < 0.05, Figure  1D) identified representative HCC proteins and showed that the calcium signaling pathway, axon guidance, circadian entrainment, and nicotine addiction pathways were significantly enriched in HCC samples.

Prognosis-related alterations in the discovery population
To evaluate the prognostic impact of mutated genes HCC patients were divided into three groups: recurrence within 6 months after undergoing LT (group A, n = 10), non-recurrence for > 1 year after undergoing LT (group B, n = 10), and recurrence > 1 year after undergoing LT (group C, n = 8). The top 10 most frequently mutated genes of each group (A, B, and C) are presented in Figure 2A-2C. Compared to patients without tumor recurrence > 1 year after LT (Group B), patients with tumor recurrence within 6 months had significantly different gene landscapes. Only TP53 and CTNNB1 genes were highly and frequently mutated between the two groups. In addition, the types of variations were complicated in patients with early recurrence. These results demonstrated that significant genetic heterogeneity affected HCC prognosis.
To further identify the representative genes associated with HCC prognosis, alterations in the five most frequently mutated genes (TP53, CTNNB1, KMT2C, ARID1A, and PCLO) were detected between groups A and B. CTNNB1, KMT2C, ARID1A, and PCLO were significantly associated with HCC prognosis (likelihood ratio P = 0.018 for CTNNB1 and likelihood ratio P = 0.03 for KMT2C, ARID1A, and PCLO; Table 2). Furthermore, we also compared the expression of these five genes between tumor samples (n = 39) and nontumor samples (n = 15) based on the expression dataset GSE65372. These findings revealed that KMT2C and ARID1A were significantly and differentially expressed between the two groups (adjusted P-values = 0.046 and 0.039, respectively, Table 2). Therefore, it could be deduced that KMT2C and ARID1A are significantly associated with HCC prognosis.

New HCC driver genes identified in the discovery population
Herein, we attempted to discover new driver genes of HCC in patients with early and late tumor recurrence, using the CHASM software to identify the driver genes in HCC. Finally, except for the known driver genes of HCC, five new driver genes of HCC, including MAP4K3, COX5B, ACTN3, CFTR, and LRRC7, were associated with the early recurrence of HCC (recurrence within 6 months) and 2 genes, PRKCG and SENP3, were considered new driver genes associated with the late recurrence of HCC (recurrence after 1 year) (P < 0.005, false discovery rate (FDR) ≤ 0.1, Table 3). These results showed that the driver genes differed significantly between patients with early and late tumor recurrence.
ARID1A, CSMD1, and SENP3 are associated with poor prognosis in HCC from TCGA database As described above WES and comparative analysis identified 11 novel genes (KMT2C, ARID1A, NBPF10, CSMD1, MAP4K3, COX5B, ACTN3, CFTR, LRRC7, PRKCG, and SENP3) that were significantly associated with HCC in our Chinese population. To validate these AGING AGING findings, the expression (Supplementary Figure 1) and survival (Supplementary Figure 2) patterns of these genes were observed in the TCGA database. Compared to the control samples, three genes (ARID1A, CAMD1, and SENP3) showed significantly high expression in HCC patients (P < 0.05, Supplementary Figure 1B, 1E, 1J) and were associated with poor prognosis (P < 0.05, Supplementary Figure 2B, 2E, 2J). According to the TCGA database, the data of KMT2C expression were insufficient, while the remaining seven genes did not show any association with HCC prognosis (P > 0.05, Supplementary Figure 2A, 2C, 2D, 2F-2I).

ARID1A, CSMD1, and SENP3 are effective prognostic biomarkers for HBV-infected HCC patients in the replication population
Herein, we focused on ARID1A, CSMD1, and SENP3 during tissue microarray analysis of 180 HCC samples paired with tumor and paracarcinoma tissues as a AGING replication population. These three genes showed significantly high expression in HCC patients in the TCGA database. The clinical characteristics of these patients are presented in Table 4. The IHC analysis of ARID1A, CSMD1, and SENP3 proteins showed that the expression level of all the three genes was significantly different between tumor and paracarcinoma tissues (P < 0.05, Figure 3G-3I). The ARID1A expression was significantly increased in tumor tissues (P < 0.05, Figure 3A, 3B) consistent with the TCGA database. Surprisingly, the expression of CSMD1 and SENP3 was significantly decreased in tumor tissues (P-value < 0.05, Figure 3C, 3D for CSMD1, and Figure 3E, 3F for SENP3).
Furthermore, ROC curve analysis assessed the prognostic power of the three genes based on their expression levels: intensity × positive expression rate. Only SENP3 had significant diagnostic power (AUC = 0.609, P = 0.013, Table 5 and Figure 4), while ARID1A and CSMD1 showed weak diagnostic power (AUC = 0.489 for ARID1A and 0.573 for CSMD1; P > 0.05; Table 5 and Figure 4). Cox regression analysis evaluated the prognostic power of the three genes ( Table 6) and showed that SENP3 as a single gene, the combination of two genes (ARID1A and SENP3; CSMD1 and SENP3), and the combination of three genes (ARID1A, CSMD1, and SENP3) could be used as prognostic biomarkers for HCC (P < 0.05, Table 6 and Figure 5). Among these, the combination of CSMD1 and SENP3 genes was the optimal prognostic biomarker for HCC in our replication population (P = 0.006, Table   6 and Figure 5). Moreover, the expression level of SENP3 ≤ 1.2 and CSMD1 < 1.5 was significantly associated with poor HCC prognosis. Based on these results, ARID1A, CSMD1, and SENP3, especially the combination of CSMD1 and SENP3, are effective prognostic biomarkers for HCC individuals in the Chinese population.

CSMD1 prevented HCC by suppressing cell invasion
Finally, to verify the function of ARID1A, SENP3, and CSMD1 genes, we performed cell proliferation and invasion assays in the hepatic Hep3B cell line. In the EdU assay, after the suppression of ARID1A, SENP3, and CSMD1 genes by siRNAs, respectively, for 48 h, the cell proliferation rate was slightly increased by suppression of CSMD1 ( Figure 6A, 6D, P > 0.05), slightly reduced by suppression of SENP3 ( Figure 6A, 6C, P > 0.05), and almost unaltered by suppression of ARID1A ( Figure 6A, 6B, P > 0.05). On the other hand, transwell invasion assay revealed that the cell invasion rate was unaltered by the suppression of SENP3 ( Figure 7A, 7C, P > 0.05) but significantly increased by the suppression of CSMD1 ( Figure 7A, 7D, P < 0.05), which was consistent with the results of tissue microarray analysis. Also, the suppression of ARID1A elevated the cell invasion rate ( Figure 7A, 7B, P < 0.05), which was inconsistent with the results of tissue microarray analysis and TCGA database. Therefore, based on these functional assays, we deduced that CSMD1 prevented HCC by suppressing cell invasion.

DISCUSSION
Recent studies have focused on the genetic characteristics of HCC and several diagnostic and prognostic biomarkers, including genes, non-coding RNAs, and specific factors, such as telomere length [10][11][12]. Since the molecular characteristics of HCC are not clearly understood, the present study analyzed the genetic landscape of HCC cases with HBV infection in the Chinese population. This study identified several novel candidate and driver genes associated with HCC, and provided evidence that these genes are significantly related to the prognosis of early-recurrence and laterecurrence HCC using the GEO and TCGA databases.
We also determined that ARID1A, CSMD1, and SENP3 (especially CSMD1 and SENP3 combined) genes are effective prognostic biomarkers for HCC in an independent replication population.
Compared to a variety of carcinomas, such as lung carcinomas, a high level of characteristic molecular heterogeneity exists in HCC that could be attributed to complex genetic and epigenetic factors [13,14]. Based on the current results, the abundance and diversity of variants are observed in HCC patients (the mean number of SNVs, indels, and CNVs was 117.5, 5.3, and 0.3, respectively). Among these gene mutations, very few occurred frequently (TP53, MUC16, CTNNB1,  TTN, ARID1A, PCLO, NBPF10, and CSMD1), which might explain the relatively poor prognosis and insufficient effective target-drug treatment for HCC. TP53 is the most frequently mutated gene in cancer with tumor suppressor functions [15]. Moreover, NBPF10 and CSMD1 are novel and the most frequently mutated genes in HCC individuals with HBV infection, indicating that chronic liver injury, such as HBV infection, affects molecular heterogeneity by interacting with host DNA.
Driver mutations and genes in cancers confer a selective advantage, which differs from the coexistence of passenger mutations in successfully expanded clonal cell lines [16,17]. Thus, identifying new driver genes is one of the greatest challenges in cancer genetics [18]. Critical driver mutations and genes that contribute to the understanding of molecular pathogenesis of HCC have been identified previously [19]. Notably, many new driver genes were associated with the early (within 6 months) and late recurrence of HCCs (recurrence after 1 year) (P-value < 0.005, FDR ≤ 0.1, Table 3) in those with HCC and HBV infection, suggesting that HBVderived processes are associated with specific mutational signatures.
ARID1A has been reported in HCC in Asians [20]. Sun et al. speculated that ARID1A exerts a tumor-suppressive  (in progression and metastasis) and oncogenic (in primary tumors) role in HCC [21]. Herein, we also observed that high ARID1A expression was significantly associated with poor HCC prognosis. Surprisingly, the expression of CSMD1 and SENP3 was significantly reduced in tumor tissues (P-value < 0.05, Figure 3C, 3D for CSMD1 and Figure 3E, 3F for SENP3), which is opposite to the information provided by the TCGA database. Typically, CSMD1 is a tumor suppressor gene that encodes CUB and sushi domain-containing protein-1 (CSMD1). Zhu et al. [22] observed that decreased protein expression of CSMD1 significantly promoted HCC cell proliferation, migration, and invasion, suggesting its functional role as a tumor suppressor gene in HCC. Furthermore, we identified that CSMD1 prevents HCC by suppressing the cell invasion in vitro. Moreover, SENP3 plays a critical role in increasing the stability of tumor suppressor P53 protein by attenuating Mdm2-mediated p53 ubiquitination and degradation [23]. Furthermore, SENP3 gene-encoded stress-sensitive SUMO-2/3specific peptidase contributes to a host defense mechanism by restoring host protein translation and suppressing HBV gene expression in HBV infection [24]. Therefore, it could be hypothesized that the reduced expression of CSMD1 and SENP3 proteins is associated with HCC recurrence, progression, and poor prognosis. These findings need to be substantiated using large samples in future studies.

AGING
Nevertheless, the present study has several limitations. First, the number of sequenced samples was small (36 HCC samples), necessitating the sequencing of additional samples to establish a comprehensive genomic landscape for HCC patients with HBV infection in the Chinese population. Furthermore, we found that the effect of SENP3 and ARID1A genes was inconsistent in different assays. Therefore, additional functional studies are required to verify the molecular mechanisms underlying SENP3 and ARID1A genes in HCC.

Clinical samples
For the discovery population, a total of 36 HCC samples were collected from patients who underwent liver transplant (LT) in our center between 2017 and 2018. All liver grafts were voluntarily donated after death, and informed consent was obtained from all recipients before LT. Tumor tissues obtained from each patient were fixed with 4% paraformaldehyde and embedded in paraffin. Additionally, 10 mL peripheral blood (PB) was collected from each case-matched patient for paired-analysis, as described previously [25,26]. For the replication cohort, a tissue microarray of 180 HCC samples with HBV infection paired with tumor and paracarcinoma tissues (HLivH180Su08 and HLivH180Su15, Shanghai Outdo Biotech Co. Ltd, Shanghai, China) was performed for immunohistochemical (IHC) analysis. HCC samples were diagnosed by histological analysis.

RNAi, cell culture, and transfection
Hep3B cell line was purchased from ATCC (American Type Culture Collection, Rockville, MD, USA) and cultured in RPMI-1640 medium supplemented with 10% fetal bovine serum (FBS) at 37° C in a humidified incubator with 5% CO2. ARID1A, CSMD1, and SENP3 shRNAs and siRNC were designed and synthesized by Guangzhou RioboBio (Guangzhou, Guangdong, China). The target sequences are listed in Table 7. For transfection, the cells were seeded at a density of 1×10 5 cells/well for 24 h. Then, the cells were transfected with siRNA fragments and negative control siRNAs.

Whole-exome sequencing (WES) and discovery population analysis
Genomic DNA was extracted from tumor tissues and matched to the PB samples. Genomic DNA libraries were prepared using the protocols provided by Illumina HiSeq2000 platform (Genetron Health Co., Ltd, China  [27]. The tumor tissues and normal PB sequencing reads were compared to identify germline single nucleotide variants (SNVs)/insertions-deletions (indels) and somatic SNVs/indels/copy number variants (CNVs)/structural variations (SVs).

GEO dataset retrieval for HCC analysis
The HCC expression dataset GSE65372 [28] was downloaded from GEO (Gene Expression Omnibus) [29] and analyzed using the GEO2R online analyzer. Next, we identified DEGs and compared the cases (tumor samples, n = 39) and controls (non-tumor samples, n = 15) using the GEO2R online analyzer. Adjusted P-values < 0.05 were considered significant, using the Benjamini-Hochberg procedure.

Identification of new driver genes
In a recent study, we used CHASM software (https://wiki.chasmsoftware.org/) to identify the driver genes in HCC, as described previously [30]. Briefly, CHASM predicts the functional significance of somatic missense mutations, using a Random Forest classifier trained with 49 predictive features. This method classifies the predictive data and provides the corresponding scores. Furthermore, the CHASM method is used to test the hypothesis between the scores and the passenger genes in the random forest training set; finally, the P-value is adjusted by Benjamini-Hochberg correction.

TCGA dataset retrieval for HCC analysis
Gene expression and survival data of somatic mutations were downloaded from the TCGA database [31] and analyzed using the UALCAN analyzer (Analyze, Integrate, Discover; http://ualcan.path.uab.edu/index. html). DEGs and survival patterns with P-values < 0.05 between HCC patients and controls in the TCGA database were identified using UALCAN.

IHC analysis for the replication population
The tissue microarray (HLivH180Su08 and HLivH180Su15, Shanghai Outdo Biotech Co. Ltd, China) was incubated in a dry oven at 63° C for 1 h. Subsequent IHC analysis was performed, as described previously [32]. Briefly, the slides were incubated with AGING Table 7. siRNA sequences for ARID1A, SENP3, and CSMD1.

EdU assay
Hep3B cells (ATCC ® HB-8064) were cultured in a 6well plate and treated with 100 μL media containing 20 μM EdU. After continuous incubation at 37° C with 5% CO2 for 24 h and 48 h, the cells were fixed with 4% paraformaldehyde for 15 min and incubated with 0.5% Triton X-100 in phosphate-buffered saline (PBS) for 10-15 min. Fluorescence microscopy was employed to acquire and analyze the images. All the experiments were performed at least three times, and the data were represented as mean ± standard deviation (SD).

Transwell invasion assay
Transwell assays were performed using polyethylene terephthalate-based migration chambers and BD BioCoat Matrigel Invasion Chambers (Becton Dickinson Labware, USA). Hep3B cells were seeded on Matrigel-coated transwell inserts with 200 μL of serumfree medium. The lower chamber was filled with 500 μL medium containing 10% FBS. After incubation for 24 h, the cells remaining on the upper surface of transwell inserts were wiped with cotton wool. The invaded cells were stained with crystal violet for 10 min. Images were captured, and the cell number was counted.
All the experiments were performed at least three times, and the data are expressed as mean ± SD.

Statistical analysis
Statistical analysis was performed using SPSS version 17.0 software package (SPSS Inc., Chicago, IL, USA) and GraphPad Prism 7.0 (GraphPad Software, San Diego, CA, USA). Clinical data were presented as mean ± SD. A paired-samples t-test was performed to test the difference between tumor and paracarcinoma tissues for the replication population. Independent samples t-test was performed to test the cell proliferation and invasion rates between siRNA (siARID1A, siSENP3, and siCSMD1) and siRNC groups. Cox regression analysis was used to evaluate the prognostic value adjusted by patient sex, age, and AJCC stage (version VII). We divided the samples into two groups based on the expression level (intensity times positive expression rate) and compared the accumulated survival rate between the two groups for each gene and combination of two or three genes. A receiver operating characteristic (ROC) curve analysis was performed, and the area under the ROC curve (AUC) was calculated to evaluate the prognostic power. The Kaplan-Meier method was used to assess the survival of patients and compared using the log-rank test. P-value < 0.05 indicated statistical significance.

Ethics approval
The clinical HCC samples complied with the Declaration of Helsinki 1975, revised in 2008. This study was approved by the appropriate local institutional review boards on human research at the Huazhong University of Science and Technology (IRB Number: S104). Written consent was obtained from all subjects before participation in the study.

Availability of data and materials
The data have been deposited with links to BioProject accession number PRJNA607376 in the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/ bioproject/).

AUTHOR CONTRIBUTIONS
Zhao YY: data curation, formal analysis, investigation, writing -original draft preparation. Yang B: formal analysis, writing -original draft preparation. Chen D: investigation, software. Zhou XJ: data curation. Wang MX: data curation and validation. Tan RM: data curation, project administration. Wang GY: data curation. Yang HF: data curation. Wang JZ: data curation. Jiang JP: project administration, resources. Wei L: writing -review and editing, conceptualization. Chen ZS: writing -review and editing, conceptualization, resources.