Mining prognostic markers of Asian hepatocellular carcinoma patients based on the apoptosis-related genes

Apoptosis-related genes(Args)play an essential role in the occurrence and progression of hepatocellular carcinoma(HCC). However, few studies have focused on the prognostic significance of Args in HCC. In the study, we aim to explore an efficient prognostic model of Asian HCC patients based on the Args. We downloaded mRNA expression profiles and corresponding clinical data of Asian HCC patients from The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) databases. The Args were collected from Deathbase, a database related to cell death, combined with the research results of GeneCards、National Center for Biotechnology Information (NCBI) databases and a lot of literature. We used Wilcoxon-test and univariate Cox analysis to screen the differential expressed genes (DEGs) and the prognostic related genes (PRGs) of HCC. The intersection genes of DEGs and PGGs were seen as crucial Args of HCC. The prognostic model of Asian HCC patients was constructed by least absolute shrinkage and selection operator (lasso)- proportional hazards model (Cox) regression analysis. Kaplan-Meier curve, Principal Component Analysis (PCA) analysis, t-distributed Stochastic Neighbor Embedding (t-SNE) analysis, risk score curve, receiver operating characteristic (ROC) curve, and the HCC data of ICGC database and the data of Asian HCC patients of Kaplan-Meier plotter database were used to verify the model. A total of 20 of 56 Args were differentially expressed between HCC and adjacent normal tissues (p < 0.05). Univariate Cox regression analysis showed that 10 of 56 Args were associated with survival time and survival status of HCC patients (p < 0.05). There are seven overlapping genes of these 20 and 10 genes, including BAK1, BAX, BNIP3, CRADD, CSE1L, FAS, and SH3GLB1. Through Lasso-Cox analysis, an HCC prognostic model composed of BAK1, BNIP3, CSE1L, and FAS was constructed. Kaplan-Meier curve, PCA, t-SNE analysis, risk score curve, ROC curve, and secondary verification of ICGC database and Kaplan-Meier plotter database all support the reliability of the model. Lasso-Cox regression analysis identified a 4-gene prognostic model, which integrates clinical and gene expression and has a good effect. The expression of Args is related to the prognosis of HCC patients, but the specific mechanism remains to be further verified.


Background
At present, liver cancer has become one of the common causes of malignancy-related death globally [1], and its incidence is increasing year by year [2]. The World Health Organization (WHO) predicts that more than 1 million people will die of liver cancer by 2030 [3]. The incidence of liver cancer has remained high in Asia, as well [4]. HCC, as the main histological subtype of liver cancer, accounts for 90% of primary liver cancer. HCC is usually caused by various risk factors, including hepatitis B virus, hepatitis C virus, fatty liver, alcohol-related cirrhosis, smoking, obesity, different dietary exposures, etc. [5,6]. Because HCC is a highly heterogeneous disease [7,8], it is difficult to predict the prognosis. HCC is highly prevalent in Asia, and it is challenging to predict prognosis. Therefore, it is necessary to develop a useful prognostic model for Asian HCC patients.
Apoptosis is a way of programmed cell death related to the changes in cell morphology and structure [9]. Apoptosis mainly includes exogenous pathways caused by death receptors on the cell surface and endogenous apoptotic pathways caused by drugs, chemicals, endoplasmic reticulum stress, perforin, and granzyme [10]. Effector caspases activated by apoptosis signals will destroy the inhibitor of apoptosis, start the activity of caspase-activated deoxyribonuclease (CAD), and then destroy the structure of DNA, inhibit the activity of proteins regulating cell structure, destroy cell structure, transform cells into apoptotic bodies [11]. Apoptosis is closely related to HCC, the defect of apoptosis-inducing pathways will lead to the abnormal proliferation of tumor cells, and the resistance of cells to apoptosis will also increase the ability of tumor cells to evade immune system surveillance [12]. Args are closely related to the progression of HCC. Therefore, we intended to use the data of Asian HCC patients in TCGA database to construct a poly-genes prognostic model of Args and verify it by the data of ICGC database and Kaplan-Meier plotter database.
In the study, we downloaded the mRNA expression profiles and corresponding clinical data of Asian HCC patients from TCGA database. Japanese HCC patients' data from ICGC and Asian HCC patients' data from Kaplan-Meier plotter database were used for validation. Through Wilcoxon-test, we found the Args which were differentially expressed between HCC and adjacent normal tissues. Univariate Cox regression analysis showed some Args were associated with the survival time and survival status of HCC patients. By taking the intersection, we found that some Args were not only differentially expressed but also correlated with the prognosis of HCC. On this basis, we used Lasso-Cox analysis to mine a HCC prognostic model. Next, we used Kaplan-Meier curve, PCA, t-SNE analysis, risk score curve, ROC curve, and the HCC data from ICGC database and Kaplan-Meier plotter database to verify the prognosis effect of the model. The overall analysis flow was shown in Fig. 1.

Data preparation and processing
As of August 31, 2020, we downloaded the RNA-seq data and corresponding clinical characteristics of 158 Asian HCC patients from TCGA database (https:// portal.gdc.cancer.gov/), including 6 normal paracancerous samples and 160 tumor samples, and the RNA-seq data and corresponding clinical information of 231 HCC patients from ICGC database (https://dcc.icgc/). The data of the 231 HCC patients downloaded from ICGC database are mainly from Japan [13]. The data of 155 Asian HCC patients from Kaplan-Meier plotter database (http://kmplot.com/analysis/) was selected as well [14].
To obtain the gene expression and clinical traits, we used R package limma [15] to process and normalize the data. We also downloaded Args from Deathbase dataset (http://deathbase.org/) and referred to the research results of GeneCards database (https://www.genecards. org), NCBI database (https://www.ncbi.nlm.nih.gov/), literature.
Identification of prognostic apoptosis-related DEGs in the TCGA dataset The expression of Args was screened from the TCGA expression matrix. The average value of repetitive genes was taken; the abnormal values were deleted. Through Wilcoxon-test, the genes with p < 0.05 were selected as DEGs. We used Univariate Cox analysis to combine the expression of Args with survival time and survival status of HCC patients to screen out PRGs. The screening condition was p < 0.05. The DEGs and PRGs were intersected to select the Args, which were not only differentially expressed between HCC tissues and adjacent normal tissues but also related to the prognosis of HCC. The protein-protein interaction network was drawn by STRING database (version11.0) (https://stringdb.org/) to clarify the interaction of the proteins, and the expression correlation coefficient diagram was drawn by R package igraph to show the expression relationship of the genes. R package pheatmap was used to illustrate the expression heatmap to show the expression of the intersect ARGs. The correlation filtering threshold is 0.2. R package survival was used to clarify the relationship between the genes and the prognosis of HCC patients. The screening condition was p < 0.05, as well.

Construction an Args related prognostic model of Asian HCC patients
The Lasso-penalized Cox regression analysis was applied to construct a prognostic model [16]. We used R package glmnet and lasso algorithm to contract variables, screen the genes that are positively related to the prognosis of HCC, and delete the genes with high correlation, to effectively avoid over-fitting of the prognostic model. The risk score of patients was calculated according to the expression of each gene and its corresponding regression coefficient. The formula was: Risk score = e sum (expression of each gene * corresponding coefficient) . Next, the HCC patients of TCGA and ICGC were divided into the high-risk and the low-risk groups according to the median cut-off risk score of the TCGA dataset.

Validation of the prognostic model
According to the gene expression of the prognostic model, R package Stats and Rtsne were used for PCA analysis and t-SNE analysis to reduce the dimensionality of data, to explore the distribution of the highrisk and low-risk samples. R package survival and survminer were used to draw the Kaplan-Meier curve to show the overall survival (OS) of different groups. We also illustrated the risk curves to show the relationship between OS and risk scores of patients in two groups. Univariate and multivariate prognostic analysis, ROC curve drawn by R package timeROC were used to determine whether the model can be regarded as a useful biomarker relating to the prognosis of Asian HCC patients. To rule out accidental error, we used the expression data of Japanese HCC patients downloaded from ICGC database for secondary detection. We also used the data of 155 Asian HCC patients of Kaplan-Meier plotter database to draw the Kaplan-Meier curve to show the correlation between the expression of BAK1, BNIP3, CSE1L, FAS and overall survival (OS) to verify the reality of this prognostic model further.

Correlation analysis with clinical traits
At first, we classified the clinical characteristics of HCC patients. Next, the model gene expression and risk score were compared with the clinical traits. Ttest was used to determine whether the gene expression and risk score were associated with the clinical characteristics of HCC patients. The screening condition was p < 0.05. And the relationship between OS and clinical traits (Stage, Gender, Age) were displayed by the Kaplan-Meier curves.

Statistical analysis
Wilcoxon-test was used to screen DEGs. Kaplan-Meier analysis was used to compare the differences in OS among different groups. Cox regression analysis of univariate and multivariate variables was performed to determine the independent prognostic factors. All statistical analyses were carried out with R software (3.6.2). P < 0.05 is considered to be statistically significant. All the data were calculated to the nearest tenthousandth.

Results
The results of data preparation and processing A total of 158 Asian HCC patients from the TCGA-LIHC dataset, 231 Japanese HCC patients from the ICGC (LIRI-JP) dataset, and 155 Asian HCC patients from Kaplan-Meier plotter database were selected. We synthesized the data of Deathbase dataset, GeneCards dataset, NCBI database, and related literature. A total of 56 Args were selected (Table 1). The results of the screening of prognostic apoptosisrelated DEGs We found that a total of 20 Args were differentially expressed between HCC tissues and adjacent normal tissues (p < 0.05) ( Table 2). Through univariate Cox regression analysis, we found that a total of 10 Args genes were related to the prognosis of HCC patients (p < 0.05) ( Table 3). The intersection of the DEGs and PGRs was taken, seven apoptosis-related prognostic DEGs were screened out (Fig. 2). There are BAK1, BAX, BNIP3, CRADD, CSE1L, FAS, and SH3GLB1. Using the confirmed protein interaction relationship of String database [80], a PPI protein interaction network was constructed (Fig. 3a). The results showed that BAK1, BAX have the most edges in the network. The R package igraph was used to draw the correlation network according to the coefficient of the expression of the above gene. Figure 3b showed that according to the expression, BAK1, BAX, and CRAD D might be the crucial genes of the network. Figure Figure 4 showed among the 7 intersect Args, BAK1, BAX, CSE1L, and SH3GLB1 were high-risk prognosis-related genes of HCC, BNIP3, CRADD, FAS were low-risk prognosis-related genes.
The results of the construction of the prognostic signature Lasso-Cox regression analysis was applied to establish a prognostic model based on the expression of the 7 genes. According to the penalty parameter (Lambda) in the model, we constructed a prognostic model of HCC patients consisting of 4 genes (Fig. 5). These genes are BAK1, BNIP3, CSE1L, and FAS (Table 4). We then calculated the risk score based on the expression of the 4 genes in the TCGA dataset and corresponding coefficient (coef) (Risk score = e (the expression of BAK1*0.4252 + the expression of BNIP3*-0.0237 + the expression of CSE1L*0.6321 + the expression of FAS*-0.1360) ) and divided the patients into high-and low-risk groups according to the median cutoff value of TCGA (risk score < = 3.5614 was low-risk, > 3.5614 was high-risk). The HCC samples of ICGC were also divided into the high-and low-risk groups according to the same median cut-off value.

The results of the validation of the 4-gene prognostic model
We had classified the Asian HCC patients of TCGA database into the high-and low-risk groups according to the median cut-off value (Fig. 6a). Figure 6b showed that in HCC patients of TCGA, the OS of high-risk patients was significantly lower than that of low-risk patients, suggesting the probability of premature death in high-risk patients was higher than that in low-risk patients. What is more, there are significant differences in gene expression. The expression of BAK1, CSE1L in the high-risk group is higher than that in the low-risk group, but BNIP3, FAS was down-regulated in the high-risk group (Fig.  6c). We used PCA analysis and t-SNE analysis for data dimensionality reduction to observe a significant difference between the high-and low-risk groups.
The results (Fig. 7a, b) showed that high-risk and low-risk groups of TCGA were a two-way distribution. Kaplan-Meier curve (Fig. 7c) showed that the OS of high-risk patients was significantly lower than that of patients with low risk at the same timing (p < 0.001). ROC curves evaluated the predictive performance of the risk score for OS, and the area under the curve (AUC) reached 0.854 at 1 year, 0.809 at 2 years, and 0.785 at 3 years (Fig. 7d). The AUC of the three timings were all higher than 0.700, suggesting that the prognostic model can be regarded as a qualified prognostic biomarker of Asian HCC patients. Then we conducted univariate and multivariate Cox regression analysis to determine whether the risk score model could be used as an independent prognostic factor of OS. In univariate Cox regression analysis, the risk score was significantly associated with the OS of the TCGA HCC patients(p < 0.001) (Fig. 8a). After adjusting for other interfering factors, multivariate Cox regression analysis showed that the risk score was still an independent predictor of OS (p < 0.001) (Fig. 8b). In order to exclude the contingency, we used the Japanese HCC samples from ICGC database for secondary detection. The HCC samples of the ICGC dataset were also categorized into high-and low-risk groups by the same median value (Fig. 9a). The result showed that most patients with ICGC were at low-risk. Figure 9b showed that OS in the high-risk group of ICGC was also lower than that in the lowrisk group. The expression levels of BAK1, CSE1L, BNIP3, and FAS were consistent with those in the TCGA patients (Fig. 9c). The results of PCA analysis and t-SNE analysis showed that the patients in the high-risk group and the low-risk group in the ICGC dataset also showed a two-way distribution (Fig. 10a,  b). The Kaplan-Meier curve showed that the OS of HCC patients in the ICGC dataset was lower than that in the low-risk group (p < 0.001) (Fig. 10c) at the same timing. Figure 10d displayed the AUC reached 0.760 at 1 year, 0.738 at 2 years, and 0.721 at 3 years. The AUC of the three-time points were all higher than 0.700 as well. We performed univariate and multivariate Cox regression analysis to determine whether the risk score model was also an independent prognostic factor of OS in HCC patients of ICGC. Through the analysis, we found that risk score could be regarded as an independent prognostic factor of the OS of IGCG HCC patients as well (p < 0.001) (Fig. 11a, b). Moreover, to verify the relationship between the prognosis model and different clinical traits, we divided the Japanese HCC samples from ICGC database into a high-risk group and low-risk group according to the risk value and drew the Kaplan-Meier curves based on the gender, age, and HCC stage (Fig. 12). The results showed that the HCC prognostic model could also be used as the prognosis of HCC with different age, gender, and stage. Meanwhile, because only used the Fig. 6 The validation of the reliability of the 4-gene signature by analyzing the risk score and the TCGA dataset. a The distribution and the median value of the risk scores. b The distributions of OS status, OS and risk score. c The expression of four prognosis-related ARGs between high-risk and low-risk group Japanese HCC dataset for validation to reduce the regional influence of HCC disease, the data of Kaplan Meier plotter database was used to verify again. We selected HCC as the research disease and 155 Asian patients as the research objective. The expression of 4 genes (BAK1, BNIP3, CSE1L, FAS) used to construct the predictive model were selected, and the Kaplan-Meier curves were drawn, respectively (Fig. 13). The results suggested that the prognostic model can be seen as an effective prognostic factor for Asian HCC patients.

The results of correlation analysis with clinical traits
The TCGA-BC-A10W and TCGA-ZP-A9CZ cases without HCC stage information were deleted. T-test was used to determine whether the expression of the genes and the risk score were associated with the clinical characteristics of HCC patients. Some clinical features of HCC, including age, sex, grade, and stage, were assessed for their probable correlation with BAK1, BNIP3, CSE1L, FAS, and risk score, as shown in Table 5. We selected the correlation, which satisfies p < 0.05 to draw the box plots. The results  showed that the expression of BNIP3 and FAS were significantly different between male and female (Fig. 13a). BAK1 and CSE1L were all significantly differentially expressed between Stage I-II and Stage III-IV patients (Fig. 13b-d). The values of risk score were significantly different between the Grade 1-2 and Grade 3-4 (p < 0.05) (Fig. 13e). The values of risk score were different between Stage I-II and Stage III-IV, as well. (p<0.001) (Fig. 14f). Correlation analysis of clinical traits showed that risk score was near related to the grade and stage of HCC. We also used the clinical character data of Japanese HCC patients in the ICGC database. The relationship between different clinical traits (Age, Gender, Stage) and OS was showed by Kaplan-Meier curves, respectively. The results ( Fig. 14) showed that the prognosis of HCC was Significant different with age, gender, and cancer stage in both high-risk and low-risk groups divided according to the median risk value of the prognostic model (P < 0.05). The result further demonstrated the prognostic model we constructed for Asian HCC patients is reliable.

Discussion
Large geographic disparities in incidence and mortality of HCC exist. Asia has the highest incidence of HCC in the world because of the presence of multiple risk factors, such as hepatitis B and the contamination of aflatoxin [81]. Another risk factor for HCC, chronic hepatitis C infection, in Asia is most significant in Japan, the only Asian country with more hepatitis C virus (HCV) than hepatitis B virus (HBV)related hepatocellular carcinoma. To eliminate the cells infected by viruses, the body will activate apoptosis, but to reduce the death of cancer cells, HCC cells will inhibit apoptosis. The balance of survival and apoptosis are closely related to the progression of HCC [12]. The etiology of HCC in Asia is quite different from that in Europe, the US, and the balance of survival and apoptosis is crucial for the progression of HCC. Therefore, it is necessary to study the cases of Asian HCC patients and find a useful Args related prognostic model based on the genes of Asian populations.
In the study, we systematically studied the expression of 56 Args and the effect of prognosis. And we constructed a new prognostic model based on apoptosis-related DEGs and verified it by the data of TCGA and ICGC databases. The HCC prognostic model consists of four Args (BAK1, BNIP3, CSE1L, and FAS). Pro-Apoptotic Protein BAK (BAK1) is a pro-apoptotic protein [21], which plays a vital role in the process of mitochondrial apoptosis. When receiving an apoptosis signal, BAK1 can change the permeability of mitochondrial outer membrane (MOM), release apoptotic factors, and activate effector caspases to realize apoptosis [82,83]. BNIP3, a pro-apoptotic member of the Bcl-2 family of apoptotic proteins [84], can overcome the inhabitation of apoptosis caused by BCL2. However, some studies have found that BNIP3 has an inhibitory effect on cancer [85]. BNIP3 can delay the progression of primary breast cancer by preventing the accumulation of dysfunctional mitochondria and reducing the resulting excess reactive oxygen species (ROS) [86]. CSE1L, the cellular apoptosis susceptibility protein, is highly expressed in various cancers [87]. It has been found that CSE1L plays an essential role in regulating apoptosis induced by chemotherapeutic drugs [88]. CSE1L can inhibit paclitaxel-induced apoptosis by affecting G2/M phase cell cycle arrest and microtubule aster formation induced by paclitaxel [88,89]. The Fas-antigen is a cell surface receptor that transduces apoptotic signals into cells [90]. Fas/FasL signaling pathway will promote cell apoptosis.
In the study, we found that BAK1, CSE1L, BNIP3, and Fas were all related to the prognosis of HCC, but the expression changes were different. BAK1 and CSE1L were up-regulated in HCC tissues, while BNIP3 and Fas were down-regulated. We recognized that it is not clear whether these genes affect the prognosis of HCC patients mainly by affecting cancer cell apoptosis because these genes affect the progress of HCC in many ways, not only apoptosis. Therefore, further studies on these four genes are needed.

Conclusion
We successfully constructed a novel apoptosis generelated prognostic model of accurately predicting the prognosis of Asian HCC patients, with higher risk scores demonstrating adverse prognosis. Kaplan-Meier curve, PCA analysis, t-SNE analysis, risk score curve, ROC curve, and the data of ICGC were used to verify   The correlation between the expression of FAS and gender. c The correlation between the expression of BAK1 and HCC stage. d The correlation between the expression of CSE1L and HCC stage. e The correlation between risk score and HCC grade. f The correlation between risk score and HCC stage the reliability of the model. We believed this model could act as a useful independent prognostic predictor for Asian HCC patients.
Additional file 1.  Fig. 14 The Kaplan-Meier curves of clinical traits. a Showing the OS of younger (age < =60) in the high-risk group and low-risk group. b Showing the OS of older (age > 60) in the high-risk group and low-risk group. c Showing the OS of women in the high-risk group and low-risk group. d Showing the OS of men in the high-risk group and low-risk group. e Showing the OS of stageI-II in the high-risk group and low-risk group. f Showing the OS of stageIII-IV in the high-risk group and low-risk group