Prediction of hepatocellular carcinoma prognosis based on expression of an immune-related gene set

Hepatocellular carcinoma (HCC) is a common type of malignant tumor with an extremely poor prognosis. Because many HCC patients are diagnosed with advanced disease, surgical treatment is typically not possible, and other currently available treatments are often ineffective. Immunotherapy is being explored as a new treatment method for a variety of cancers, including HCC. However, there have been no systematic reports about the relationship between immune-related genes and HCC patient prognosis. In this study, we established and verified a gene set-based model to examine the relationship between immune-related genes and prognosis in HCC patients. The model was based on a dataset from The Cancer Genome Atlas (TCGA), and its stability and reliability was confirmed in four verification datasets. In addition, we performed multivariate Cox regression analyses to identify the independent risk factors affecting HCC patient prognoses. We found that this new model based on immune-related genes was effective for predicting prognosis, evaluating disease state, and identifying treatment options for HCC patients.


INTRODUCTION
HCC, one of the most prevalent and life-threatening malignancies in the world, progresses rapidly and is difficult to treat. Most HCC patients are diagnosed with late-stage disease and present with distant metastases, portal vein tumor thrombus, and other morbid conditions, resulting in extremely poor prognoses [1][2][3]. Although chemotherapy, radiotherapy, and other treatment methods have modestly improved HCC patient survival rates in recent years, therapeutic outcomes are still largely unsatisfactory [4][5][6]. New treatment methods for advanced HCC are therefore needed to improve overall survival rates.
Immunotherapy has emerged as a promising potential treatment for a variety of cancers, including HCC [7,8]. Previous research revealed that reactivation of NK cells and their cytotoxic activity against tumor cells can enhance anti-HCC effects [9]. In addition, immunestimulating cytokines such as IFNG γ can inhibit HCC progression by inducing apoptosis or autophagy of HCC cells. However, due to inherent cancer variability, outcomes after immunotherapy are often unsatisfactory [10][11][12][13]. Increasing evidence indicates that expression of immune-related genes can be associated with tumor prognosis, and prognostic signatures based on these genes might help identify effective treatments for HCC patients [14]. The relationship between immune-related genes and prognosis therefore deserves further investigation.
The immune system plays an important role in the development and progression of HCC. The liver's immunosuppressive microenvironment allows it to tolerate numerous exogenous intestinal bacteria and antigens that arrive via the portal vein. However, the liver is also unable to attack malignant tumor cells as a consequence. Combined immunotherapies are therefore often necessary to alter this pro-tumor microenvironment. Studies have shown that inhibiting PD-1 promotes vascular normalization and anti-tumor immune response in HCC [10,15]. However, these studies involved small numbers of HCC patients, and the relationship between HCC immunotherapy and patient prognosis requires further investigation. A systematic characterization and analysis of the tumor immune microenvironment and its impact on prognosis is needed.
In this study, we integrated a multi-gene expression cohort of 903 cases to establish and validate an individualized immune-related gene set based on HCC prognostic signatures. Four independent datasets were also evaluated to verify the stability and reliability of our models. In addition, we performed a comprehensive analysis that incorporated clinical characteristic information to improve the accuracy of overall survival rate predictions.

Defining the single-sample immune gene set enrichment analysis
In total, 903 HCC patients from five datasets were included in the immune-based prognostic signature HCC (IPSHCC) analysis (Supplementary Table 1). In the training set, 1,810 genes representing 17 immune categories were identified; 196 of these genes representing 15 immune categories were related to overall survival. Those 196 genes were used as probes in single-sample gene set enrichment analysis (ssGSEA) of HCC patients to determine enrichment scores for each immune category ( Figure 1). IPSHCC was defined as the comprehensive influence of coefficients generated by a multivariate Cox regression model on scores of different categories ( Table 1). The median score of the training set patients (-0.0087) served as a cutoff value for dividing patients among low and high immunity risk groups in all datasets.  (Table 2). In the multivariate Cox model, even after controlling for age, stage, gender, and tumor invasion, immunity risk was still an independent prognostic factor (Figure 2A-2E). In an integrated analysis of all datasets, the probability of survival for the high immunity risk group was 2.6416 times lower than that of the low immunity risk group (HR = 2.6416; 95% CI: 2.053-3.399; p = 4.34 × 10 -14 ) ( Figure 3A). The distribution of IPSHCC with survival state in the composite dataset is shown in Figure 3B.

IPSHCC typing and sensitivity analysis
We analyzed the sensitivity of our model depending on age, gender, stage, and tumor invasion to examine its stability in different clinical subgroups. IPSHCC was significant for all subgroups (Supplementary Figure 1), suggesting that it may be independent of clinical characteristics. In addition, we identified 196 genes associated with immune processes, including antimicrobials (38.27%), cytokines (26.53%), and cytokine receptors (25.51%) ( Table 3). We calculated the immune score of the different subgroups for each immune process using the ssGSEA method. Patients with high immune scores had significantly longer median survival times for each process ( Figure 4A). To test the robustness of IPSHCC, we randomly resampled 500 cases 10,000 times from the consolidated datasets. P-values of all samples were less than 0.05 in each re-sampling instance ( Figure 4B and 4C). The median C-index value was 0.6819 and the standard deviation (SD) was 0.0091, demonstrating robust predictive ability.

Pathway enrichment analysis
Enrichment analysis for the 196 unique immune genes identified 69 related KEGG pathways (p < 0.05); for example, cytokines and cytokine receptors interacted with the MAPK, RAS, B/T-cell receptor, and PI3K/AKT signaling pathways. The PI3K/AKT signaling pathway regulates proliferation and survival of hepatoma cells, and abnormal activity in this pathway is associated with malignant transformation of hepatocytes, migration, adhesion, tumor angiogenesis, and degradation of the extracellular matrix. Tumor therapy strategies targeting the key molecules of the PI3K/AKT signaling pathway are currently being developed. In total, pathway analysis based on gene ontology identified 205 biological processes, 57 molecular functions, and 30 cellular constituent pathways representing a diverse spectrum of biological activities. AGING

Comparison with other prognostic signatures and clinical characteristics
After evaluating the accuracy and clinical consistency of IPSHCC modeling for predicting HCC, we calculated and compared continuous prediction scores according to five other disease prognostic signatures in different datasets using a univariate Cox model. Among 10 survival predictors, IPSHCC had the highest mean C-index (0.709) compared to age (0.526), stage (0.673), invasion (0.571), and gender (0.542) ( Table 4, Figure 5A). The p-value of the IPSHCC prediction score was also the lowest among the survival predictors (p = 9.22 × 10 -7 ) across datasets. (Table 4, Figure 5B).

Integrating IPSHCC and clinical characteristics
Besides IPSHCC, clinical characteristics such as age, gender, stage, and invasion were independent but complementary prognostic factors. To further enhance the predictive accuracy of IPSHCC, we integrated coefficients generated in the multivariate Cox regression model from the training set with IPSHCC (continuous score) as follows: (integrated model = 0.776924010 × IPSHCC + 0.004843653 × age + 0.625080315 × stage + 0.061769897 × gender -0.319739104 × invasion). We then validated the integrated model using the HCCDB18 verification set, for which complete clinical information was available. The continuous score of the integrated model was only Figure 1. Flowchart of the study. A total of 903 HCC patients from five separate datasets were included in the analysis. We developed the immune-based prognostic signature for HCC (IPSHCC) using the training dataset and validated it in five independent validation subsets. We also integrated IPSHCC with stage, invasion, age, and gender to improve its prognostic value.  Figure 6A and 6B).

DISCUSSION
Hepatocellular carcinoma (HCC) is the fourth leading cause (about 8.2%) of cancer-related death among men and women globally [16]. Despite the tremendous progress made recently in therapeutic strategies for HCC, outcomes for patients with advanced disease remain poor [2]. Identification of novel mechanisms of HCC progression and of effective targets is therefore crucial for improving HCC prognosis.
Changes to the tumor microenvironment are closely associated with alterations in the immune system, and immune checkpoint inhibitors might help counteract the immunosuppressive effects of the HCC microenvironment [21][22][23][24]. Immunotherapies such as PD-1 [25] and CAR-T cells [26] have already been used to treat advanced HCC, and additional immunotherapies with novel targets might prove even more effective.
For the first time, we systematically studied the relationship between immune-related genes and prognosis in HCC patients. We confirmed that our IPSHCC model, together with clinical characteristics such as age, stage, gender, and tumor invasion, are independent prognostic factors for HCC. The immunerelated genes included in the IPSHCC have been well characterized and have improved clinical adjunctive therapy and understanding of staging and progression in HCC [27,28]. Xu et al. found that gene signatures associated with prognosis and immune infiltration in the renal cell carcinoma microenvironment might aid in the identification of effective immunotherapies [29]. Additionally, Shen et al. developed a promising prognostic signature and a method for evaluating clinical immunotherapies based on an immune gene set in ovarian cancer [30]. IPSHCC might aid in the discovery of new targets for molecular immunotherapies similar to those that have already proven effective in previous basic research and clinical studies. AGING    Because IPSHCC is based on large sets of sequencing results obtained using different platforms and representing many patients, it generated more reliable prognostic predictions than other clinical characteristics. This allowed the classification of patients into different subgroups that might benefit from different personalized treatments based on their immune classification. However, IPSHCC only describes biomarkers; the biological mechanisms by which these biomarkers affect the development and progression of HCC require further investigation. Incorporation of additional patient follow-up data would also help improve the accuracy of this model for prognostic prediction.
In summary, our results demonstrate that IPSHCC is a promising model for predicting prognosis in HCC patients based on immune gene sets. Furthermore, this model might help identify novel therapeutic targets for advanced HCC.

Gene expression profiles and study objectives
HCC gene expression profiles were retrospectively collected from the following five datasets: one from the Cancer Genome Atlas-Liver Hepatocellular Carcinoma (TCGA-LIHC) cohort, three from Gene Expression Omnibus (GEO), and one from the Liver Cancer Institute (LIRI-JP cohort). All patients had undergone a primary surgical excision of the tumor, and all pertinent clinical information, such as follow-up time, survival state, and gene expression levels were available. Our main objectives of this study were to compare overall survival of HCC patients with different immune characteristics.

Processing of gene expression data
Only GEO cohorts with ≥ 80 subjects were used. Because a large number of samples in the TCGA-LIHC cohort had been characterized by RNA-seq gene expression (n = 308), we specifically examined microarray data from the Affymetrix Human Genome U133A array (n = 209) as described previously [29,31]. Finally, data from the different GEO and TCGA-LIHC platforms were grouped into five independent datasets (Supplementary Table 1). An Entrez ID was utilized to represent each gene. The independent verification phase was executed as described previously [30]. All gene expression probes were individually adjusted in each dataset. Gene expression levels were logarithmically transformed before adjusting the batch effect.

Identification of immune-related genes
We constructed a predictive signal gene set for immunerelated genes which was divided into 17 categories based on molecular function, such as antimicrobials, cytokines, interleukins, T-cell receptor signaling, B-cell receptor signaling, and TNF family receptors. The details of genes in each category have been reported previously [32].

Establishment of Immune-Based Prognostic Signature HCC (IPSHCC)
The immune-based prognostic signature HCC (IPSHCC) model was established as follows. We utilized the TCGA-LIHC cohort as the training set to screen genes associated with overall survival. The immune-related gene set included 1,810 genes, of which 1,356 were detected in the TCGA-LIHC cohort. We utilized the Cox proportional hazards model to evaluate the effect of each gene in combination with age, stage, invasion, and gender on overall survival. Genes with p-values greater than 0.05 were excluded. Next, we adopted single-sample gene set enrichment analysis (ssGSEA) to define an enrichment score representing the absolute enrichment of a gene in each sample of a given dataset as described previously [33]. Standardized enrichment scores were calculated for each immune category, and ssGSEA was conducted using the GSVA package for R. Finally, we established the IPSHCC model by combining the effect of every immune category in the training set. Multivariate Cox regression analysis was used to determine the coefficient of each category. Model: where Si is the ssGSEA score of i th immune category.

Verification of IPSHCC
For a unified cutoff value, we divided patients into high-and low-risk groups. Gene expression levels were standardized in each dataset (the average value is 0, SD is 1). IPSHCC prognostic scores obtained from the training dataset were further analyzed using four verification datasets. For the multivariate Cox regression, age, stage, gender, and tumor invasion were all covariates.

Pathway enrichment analysis
To further understand the function of the genes in IPSHCC, pathway enrichment analysis was conducted based on KEGG and GO databases as described previously [34]. Biological processes, molecular functions, and cellular constituents were included in the analysis. Multiple comparisons of p-values were made using the false discovery rate method, and all analysis was conducted in the R package cluster analysis program.

Comparison with existing prognostic signatures
We collected five public prognostic signatures for comparison, including three to nine genes, to explore the survival classifications and predictive ability of IPSHCC. Continuous prognostic scores were calculated for each signature. Differences in continuous score pvalues and population Cointegration statistics (C-index) from the univariate Cox model were compared in the five data sets.

Statistical analysis
All data are shown as mean ± standard deviation (SD). For survival analysis, the Cox proportional hazards model was used to evaluate the relationship between gene signature and overall survival. Kaplan-Meier survival curves were plotted for each subgroup and compared with the Log-Rank Test. The R packages AGING survival and survrm2 were used to estimate C-index values and mean survival rate curve limits, and the R package prelim was used to compare C-index values.

AUTHOR CONTRIBUTIONS
All of the authors worked collaboratively on the work presented here. HYT and GWZ defined the research theme and discussed analyses, interpretation, and presentation. HYT and DQ drafted the manuscript and analyzed the data. LJ and YX developed the algorithm, interpreted the results, and helped with statistical analysis and reference collection. ZQY and XMM collected data and helped draft the manuscript. All authors read and approved the final manuscript. AGING