Machine Learning for Building Immune Genetic Model in Hepatocellular Carcinoma Patients

Background Hepatocellular carcinoma (HCC) is the leading liver cancer with special immune microenvironment, which played vital roles in tumor relapse and poor drug responses. In this study, we aimed to explore the prognostic immune signatures in HCC and tried to construct an immune-risk model for patient evaluation. Methods RNA sequencing profiles of HCC patients were collected from the cancer genome Atlas (TCGA), international cancer genome consortium (ICGC), and gene expression omnibus (GEO) databases (GSE14520). Differentially expressed immune genes, derived from ImmPort database and MSigDB signaling pathway lists, between tumor and normal tissues were analyzed with Limma package in R environment. Univariate Cox regression was performed to find survival-related immune genes in TCGA dataset, and in further random forest algorithm analysis, significantly changed immune genes were used to generate a multivariate Cox model to calculate the corresponding immune-risk score. The model was examined in the other two datasets with recipient operation curve (ROC) and survival analysis. Risk effects of immune-risk score and clinical characteristics of patients were individually evaluated, and significant factors were then used to generate a nomogram. Results There were 52 downregulated and 259 upregulated immune genes between tumor and relatively normal tissues, and the final immune-risk model (based on SPP1, BRD8, NDRG1, KITLG, HSPA4, TRAF3, ITGAV and MAP4K2) can better differentiate patients into high and low immune-risk subpopulations, in which high score patients showed worse outcomes after resection (p < 0.05). The differentially enriched pathways between the two groups were mainly about cell proliferation and cytokine production, and calculated immune-risk score was also highly correlated with immune infiltration levels. The nomogram, constructed with immune-risk score and tumor stages, showed high accuracy and clinical benefits in prediction of 1-, 3- and 5-year overall survival, which is useful in clinical practice. Conclusion The immune-risk model, based on expression of SPP1, BRD8, NDRG1, KITLG, HSPA4, TRAF3, ITGAV, and MAP4K2, can better differentiate patients into high and low immune-risk groups. Combined nomogram, using immune-risk score and tumor stages, could make accurate prediction of 1-, 3- and 5-year survival in HCC patients.


Introduction
Hepatocellular carcinoma (HCC) is one of the most malignant tumors around the world, causing the second highest mortal rate with poor responses to therapies [1,2]. e immune microenvironment of tumors has been testified to play vital roles in tumor progression and relapse, which incurred development of immune therapies, such as application of checkpoint inhibitors and Car-T transfusion [3].
ough these drugs, approved by the federal government in several countries, demonstrated efficacy in tumor regression and prolonged overall survival, overall response was not satisfying in patients, which may be related to tumor mutation burden and immune infiltration levels [4]. Understanding immune microenvironment within HCC can better predict patients' survival, which can also be used to guide drug usage or treating strategies.
ere have been some studies investigating prognostic signatures of HCC; however, high-precision prognostic models based on immune-related genes were few in HCC [5,6]. In this study, we managed to construct an immunerisk model to differentiate HCC patients into high-and lowrisk subgroups for survival prediction. High score patients had worse prognosis after resection, and the significantly changed immune genes between HCC tissues and normal tissues were highly enriched in pathways of cell growth and tyrosine kinase inhibitor resistance. Combining immunerisk score and tumor stage, we constructed a nomogram with high precision, which can help therapists make comprehensive clinical evaluation of patients in practice.

Analysis of Differentially Expressed Immune Genes.
RNA expression levels of different genes from patients in TCGA dataset were analyzed between the hepatocellular carcinoma tissues and the para-tumor tissues, using Limma package in R environment [9]. e genes expressing more than 1-fold change with adjusted p value under 0.05 were considered significant after normalization and background correction. e package of Heatmap was used to create the heatmap of significantly up-and downregulated genes.

Gene Ontology and Gene Enrichment.
To better understand the functional roles of the significantly changed genes, the package of clusterProfiler was used to demonstrate the gene ontology and enriched pathways, including the cellular compartment, biological process, molecular function, and KEGG pathways (Kyoto Encyclopedia of Genes and Genomes, https://www.kegg.jp) [10]. e further gene set enrichment analysis (GSEA) between high and low immune-score patients was performed with GSEA 4.0.1 software in order to locate the related gene sets [11,12].

Construction of Immune Prognostic Models and
Nomogram.
e RNA sequencing data in TCGA LIHC dataset were used to find the differentially expressed immune genes, and through univariate Cox regression, survival-related immune genes (overall survival and progression free survival) (p < 0.05) were selected for risk evaluation.
e overlapped gene set (68 signatures) was chosen to find the most relevant prognostic genes, and random forest algorithm was used to compress the gene list. e random forest algorithm was performed to determine the important of immune-related genes, and the relative importance >0.2 was identified as the final signature. e  finally yielded 8 immune genes (SPP1, BRD8, NDRG1,  KITLG, HSPA4, TRAF3, ITGAV, and MAP4K2) were also evaluated for their variable relative importance and were further put into multivariate Cox analysis to generate the immune-risk model. ese analyses were performed with R package of randomForestSRC and algorithm of random-SurvivalForest. In construction of the model, median risk score (0.961) was deployed to assign patients into high and low immune-risk groups, and, correspondingly, prognostic value of the model was examined in the training set of TCGA and the testing sets of ICGC and GSE14520 (immune-risk score � 0.115 * expSPP1 + 0.263 * expBRD8 + 0.125 * expNDRG1 + 0.233 * KITLG + 0.195 * HSPA4 + 0.319 * TRAF3 − 0.186 * expITGAV + 0.333expMAP4K2). Immune-risk score and clinical characteristics of patients, such as age, gender, tumor grades, and stages, were then put into Cox model to evaluate corresponding risk effect. Immune-risk score and tumor stages were significantly correlated with survival of patients before or after adjustment, which were then used to create a nomogram for prognosis prediction in 1-, 3-, and 5-year follow-up. C-index of the nomogram was calculated with bootstrap of 1000 resamples, ranging from 0.5 to 1.0. Precision of the model was examined by calibration graphs, in which the alignment of both lines was a fact of good performance, reflecting the actual probability. Recipient operation curves (ROCs) were used to demonstrate its specificity and sensitivity in comparison with other factors. Decision curve analysis (DCA) was also deployed to evaluate the potential clinical benefits of the model. ROC and DCA were performed with R packages of rms, survcomp, and survivalROC. Kaplan-Meier (K-M) method was used to estimate the accumulative incidence of survival between different subpopulations in each dataset. e survival curves were drawn through R package of survival and compared with log-rank test.

Analysis of the Infiltrated Immune Cells.
e infiltration of six common immune cells was estimated through online database of TIMER (https://cistrome.shinyapps.io/timer/), which is an open tool with data access to 32 types of cancers in TCGA database [13,14]. Infiltration levels of B cells, CD4+ T cells, CD8+ T cells, neutrophils, dendritic cells, and macrophages were calculated correspondingly, and the correlation between immune-risk score and infiltrated cell types was examined by Pearson's correlation test. Evaluation of immune status's difference between high and low immune-risk patients was performed with single sample gene set enrichment analysis (ssGSEA), using gene sets in MSigDB pathway lists. Expression of well-known immune checkpoints was also compared between groups. p value under 0.05 was considered significant.

Differentially Expressed Immune Genes between HCC and
Para-Tumor Tissues. Work flow of this study was shown in Figure 1. We found 311 differentially expressed immune genes between HCC and para-tumor tissues (log2-transformed fold change> 1, p < 0.05). In the list of significantly changed genes, 259 genes were upregulated and 52 genes were downregulated (Figures 2(a) and 2(b)). Gene ontology of those significantly changed genes was mainly involved in signal transduction and cytokine regulation (Figures 2(c) and 2(e)). e highly enriched pathways were MAPK signaling pathway, Rap1 signaling pathway, Ras signaling pathway, EGFR tyrosine kinase inhibitor resistance, ErbB signaling pathway, and so on, which were well-known signalings for tumor growths and metastasis, indicating the protumor immune microenvironment (Figures 2(d) and 2(f )). We also analyzed the differentially expressed transcription factors (TFs) between normal and tumor tissues in HCC patients (Supplemental Figures 1(a)  and 1(b)). It turned out most TFs were upregulated in tumor tissues (108 upregulated, 9 downregulated).
Constructed immune-risk model with eight signatures can better stratify patients into subpopulations.
We performed univariate Cox analysis to locate survivalrelated genes in the training set from TCGA database. In association with overall survival (OS), we located 93 signatures, while in association with progression free survival (PFS), we located 116 signatures. e overlapped 68 signatures were used to build immune-risk model (Figure 3(a)). e regulating network between differentially expressed TFs and survival-related immune genes was also examined (Supplemental Figure 1(c)).
Enriched pathways in high immune-risk score patients demonstrated unique biological behaviors of HCC cells.
rough gene set enrichment analysis (GSEA), we tried to understand the differentially enriched signals between high and low immune-risk patients. e results showed that signatures, related to pathways of cell cycle, apoptosis, NOD like receptor signaling, Notch signaling, and VEGF signaling, were significantly enriched in high immune-risk patients, indicating the proliferative status of HCC cells ( Figure 7, Table 2).
HCC infiltrated immune cell populations were also analyzed to find whether the infiltration patterns were related to immune-risk score. It turned out infiltration of six types of immune cells (B cell, CD4+ T cell, CD8+ T cell, dendritic cell, macrophage, and neutrophil) were positively related to immune-risk score (p < 0.05). High immune-risk score was correlated to high infiltration of immune cells, which could also be extrapolated as exhaustive immune status (Figure 8). Of the six cell types, macrophages and neutrophils showed the highest relevance and significance. Also, using ssGSEA analysis to evaluate immune status between groups, we found the overall immune activity score of different signals was relatively higher in high immune-risk patients. Expression levels of immune checkpoints were also higher in high immune-risk patients (Supplemental Figure 3).
We further analyzed the survival difference between immune cell infiltration groups, and there was no significant       difference between high and low infiltration groups of six immune cell types (Figure 9(a)). However, after assigning patients according to immune-score and each cell type infiltration score, we found high immune-risk score was related to worse prognosis consistently, which was independent on immune cell infiltration levels (Figures 9(b)-9(g)). e patients with high-risk score and low B cell, CD4 + T cell, CD8 + T cell, or dendritic cell showed the worst OS, especially high-risk score combined with low B cell or CD 8+ T cell (Figures 9(b)-9(e)). However, HCC patients with high-risk score and high macrophage presented the worst OS, while the patients with low-risk score and low macrophage had the best OS (Figure 9(f )). is is perhaps because that tumor-associated macrophage can promote cancer-related malignancy. HCC patients with high-risk score had worse OS than those with low-risk score, regardless of neutrophil level (Figure 9(g)).

Combined Nomogram Can Better Guide Treating Strategies in Clinical Practice.
In consideration of all the survivalrelated factors, we generated a nomogram with tumor stages and immune-risk score to evaluate 1-, 3-and 5-year OS of HCC patients (Figure 10(a)). Calibration curves for the 3 models were also generated to show the consistence between estimation and actual probability (Figure 10(b)). e AUCs for the combined model in prediction of 1-, 3-, and 5-year OS were all over 0.7, which were better than the other two single-factor models (Figure 11(a)). DCAs of the three models also showed superiority of the combined model in prediction of 1-, 3-, and 5-year OS at various threshold probabilities (Figure 11(b)).

Discussion
In this study, we focused on the immune status of HCC to find the changed immune genes between HCC and relatively normal liver tissues, and through regression analysis, we further located eight survival-related signatures to construct an immune-risk score model for prognostic prediction. High score patients shared a worse outcome in comparison to the low-risk ones significantly, which was independent on immune infiltration levels, though high immune-risk score positively correlated with immune cell infiltration. e further constructed nomogram, using immune-risk score and tumor stages, could better predict the overall survival of patients in 1, 3, and 5 years than single factor, which is very useful for patient monitoring and instructive for choice of therapy in clinical practice. ough, many previous studies tried to find the differentially expressed prognostic biomarkers in HCC, immune-related models were seldom used to evaluate patients. However, with the development of immune therapy in cancer treatment, such as immune checkpoint blockade (ICB) and Car-T transfusion, immunerisk evaluation of cancer patients may better predict patients' survival and the following response to ICB or Car-T treatment.
In our study, MAPK signaling, Ras signaling, ErbB signaling, and EGFR tyrosine kinase inhibitor resistance pathways were enriched in high immune-risk patients, specifying the highly proliferative status of cancer cells in high immune-risk patients. Immune microenvironment of high immune-risk patients was more in favor of tumor growth. Also, though the immune-risk score was positively related to tumor infiltrating immune cells, high immunerisk patients tended to share a worse outcome after HCC resection, which may be due to exhaustive immune status. Former studies have shown that immune cell function depression was related to cancer progression, and we thought those high immune-risk patients could have better treating effect with ICB or Car-T therapy.
In the immune-risk model, secreted phosphoprotein 1 (SPP1 or OPN) is a widely studied signature in tumors, which is normally involved in the process of osteoclasts' attachment to mineralized bone matrix. It is elevated in tumors for progression and metastasis, and its alternatively spliced variants are related to many malignant traits in cancers, such as epithelial-mesenchymal plasticity, cancer cell stemness, chemoresistance, and radioresistance [15][16][17][18][19][20]. However, the variants of SPP1 due to aberrantly processing may cause autoimmune reactions in tumors, which can lead to better outcomes in partial patients [21,22]. In HCC, SPP1 is a validated prognostic biomarker, and it may induce chemoresistance through regulation of autophagy in HCC cells [23,24]. e next step of exploration shall focus on the involved regulatory pathways in order to find the potential drugs and inhibitors. Two of the signatures in our immune-risk model were related to NF-kB and TGF-b signal pathways. Heat shock protein family A (Hsp70) member 4 (HSPA4) was formerly reported to involve inflammation responses and could be used as a biomarker for early lung cancer diagnosis and glioma outcome evaluation [25,26]. It has been found extracellular HSP70 could activate ERK1/2, NF-kB, and proinflammatory genes transcription in lung cancer cell, and in breast cancer, tumor-educated B cells could target HSPA4 to promote lymph node metastasis through Src/NF-kB pathway [27,28]. Also, 2 cells in allergic disease could increase Hsp70, which was involved in pathogenesis [29]. N-myc downregulated 1 (NDRG1) was a tumor suppressor, which has been reported in various studies; it could attenuate NF-kB and TGF-b pathways in pancreatic cancer cells, and downregulated NDRG1 was related to increased proliferation, invasion, and migration of digestive cancers [30][31][32][33]. It additionally involves apoptosis, glycolytic, and lipid metabolism in cancer cells, and virus infection process was also related to NDRG1 [34][35][36][37][38][39][40]. Another signature, mitogen-activated protein kinase 2 (MAP4K2), was involved in MAPK signaling pathway, which corroborated with the enrichment results. We thought these pathways regulated by the three signatures involved crucial proliferation and progression process of tumors, and they all influence HCC tumor immune microenvironment through unrevealed mechanisms.
In the immune-risk model, bromodomain containing 8 (BRD8) was an androgen receptor coactivator, while TNF receptor associated factor 3 (TRAF3) was also involved in functioning of a variety of receptors. e knowledge of BRD8 in immune regulation was rare; however, it was related to p53-dependent apoptosis and could be a chemosensitizing target in colorectal cancer [41][42][43]. TRAF3 has been found to interplay with toll like receptors and TNF receptors in lymphocytes, and previous studies have found TRAF3 could attenuate noncanonical NF-kB pathway, influencing B cell and T cell development and recruitment through chemokine regulation [44][45][46][47][48][49]. Kit ligan (KITLG or stem cell factor, SCF) is a cytokine, which has been testified to be expressed by both cancer cells and immune cells and related to tumor growth, metastasis, and stemness [50,51]. In HCC, KITLG has been found to be an independent prognostic factor, and it can bind to c-kit receptor expressed by various immune cell types, leading to pathological process of allergy [52][53][54][55]. High expressions of BRD8, TRAF3, and KITLG were related to hyperfunction of receptors in tumor or immune cell populations, which could in turn influence the survival of patients. We thought high expression of the three markers in HCC was related to exhaustive immune status, indicating highly communicative signals in tumor microenvironment, in which ICB treatment could yield more benefits.
Overall, the immune-risk model and the combined nomogram are of great value in clinical practice for prognostic prediction. e immune microenvironment difference between high and low immune-risk score patients may decide patients' responses to immune treatment, and understanding the immunogenetic changes or patterns of immune infiltration in HCC can optimize treating strategies and drug application [56].
ere are several limitations about our investigation. Firstly, expression analysis through bioinformatic methods still needs tissue sample confirmation in consideration of other potential confounding factors, such as ethic bias. Also, the immune status behind immunogenetic changes requires further exploration of infiltrated immune cells, which were simply estimated in our study.

Conclusion
e immune-risk model, based on expression of SPP1, BRD8, NDRG1, KITLG, HSPA4, TRAF3, ITGAV, and MAP4K2, can efficiently differentiate HCC patients into high and low immune-risk subpopulations, and in combination with tumor stages, the derived nomogram can precisely predict the 1-, 3-, and 5-year overall survival among HCC patients, providing a tool for prognostic prediction.

Additional Points
Summary. Hepatocellular carcinoma has been one of the most malignant tumors, conflicting a large number of patients worldwide with poor responses to drugs. e high recurrent rate after resection makes outcomes of patients even worse. ough new immune targeting drugs approved by federal government showed some light in treatment, the overall responses among patients are still unsatisfactory. As a result, the prognostic prediction of patients is very important in consideration of treating strategies and patient monitoring. Former studies tried to find the biomarkers in HCC for diagnosis and prognosis, the scope of immunogenetic changes have seldom been investigated. Our study managed to find the significantly changed immune signatures in HCC, and based on survival-related immune genes, we constructed an immune-risk model for prognostic prediction. e highrisk patients shared worse outcomes, and the enriched pathways demonstrated the protumor growth immune microenvironment. Also, the prognostic value of immunerisk score in HCC patients was independent on immune infiltration levels, while it did positively correlate with infiltrating immune cells, indicating high-risk patients may have better ICB therapy efficacy with immune exhaustive status. In combination with tumor stages, the nomogram showed high accuracy in prediction of 1-, 3-, and 5-year overall survival, which is valuable for clinical evaluation in HCC treatment.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
Jun Liu and Zheng Chen were both involved in analysis and interpretation of the study and drafted the manuscript equally. Wenli Li took part in the design and data collection process of the study and is responsible for the content of the manuscript. All authors read and approved the final manuscript.

Supplementary Materials
Supplemental Figure 1. Differentially expressed transcription factors in HCC tissues and the regulation network between transcription factors and immune signatures. Supplemental Figure 2. PFS difference between high and low immune-risk score patients in TCGA dataset and the respective prognostic values for 0.5-, 1-, 2-, 3-, and 5-year survival. Supplemental Figure 3. Comprehensive immune status evaluation of high and low immune-risk HCC patients. (Supplementary Materials)