TCGA and ESTIMATE data mining to identify potential prognostic biomarkers in HCC patients

Hepatocellular carcinoma (HCC) is an aggressive form of cancer characterized by a high recurrence rate following resection. Studies have implicated stromal and immune cells, which form part of the tumor microenvironment, as significant contributors to the poor prognoses of HCC patients. In the present study, we first downloaded gene expression datasets for HCC patients from The Cancer Genome Atlas database and categorized the patients into low and high stromal or immune score groups. By comparing those groups, we identified differentially expressed genes significantly associated with HCC prognosis. The Gene Ontology database was then used to perform functional enrichment analysis, and the STRING network database was used to construct protein-protein interaction networks. Our results show that most of the differentially expressed genes were involved in immune processes and responses and the plasma membrane. Those results were then validated using another a dataset from a HCC cohort in the Gene Expression Omnibus database and in 10 pairs of HCC tumor tissue and adjacent nontumor tissue. These findings enabled us to identify several tumor microenvironment-related genes that associate with HCC prognosis, and some those appear to have the potential to serve as HCC biomarkers.


INTRODUCTION
Hepatocellular carcinoma (HCC) is a prevalent malignant tumor and a leading cause of cancer-associated death globally [1]. More than 50% of the world's HCC occurs in China, and the incidence rate has been increased yearly [2]. The 5-year survival rate among advanced HCC patients is less than 5% due to the disease's high recurrence and metastasis rates [3]. The occurrence and development of HCC is a multifactorial, multistage process that involves the hepatocytes themselves as well as their microenvironment [4].
Previous studies have shown that the tumor microenvironment not only influences gene expression in HCC, but also the clinical outcomes of the patients [5][6][7]. Immune and stromal cells are vital nontumor components of the tumor microenvironment and can be used for diagnostic and prognostic evaluation. The immune score is a standard method for quantifying T cell and cytotoxic T cell density within a tumor microenvironment and provides data that are predictive of patient outcomes [8]. Immune and matrix scores are calculated using the ESTIMATE (Estimation of STromal and Immune cells in MAlignant Tumor tissues AGING using Expression data) algorithm. This method facilitates quantification of immune and matrix constituents within tumors [9], and there is increasing evidence that an inflammatory environment is predictive of clinical outcome [10][11][12][13].
To better understand the impact of stromal and immune cell-related genes on prognosis, in the present study we systematically analyzed tumor expression profiles and explored tumor microenvironment-related genes associated with a poor prognosis and their potential regulatory mechanisms. We initially analyzed HCC cohorts in The Cancer Genome Atlas database and used ESTIMATE immune scores to predicted genes that significantly affect outcomes in HCC patients. We then used the string database to perform functional annotation of these genes, after which we used qRT-PCR to verify expression of eight genes of interest in clinical samples and TCGA database. Finally, we performed a survival analysis to verify the impact of the eight genes of interest using a different HCC dataset from the Gene Expression Omnibus database (GSE14520).

Immune scores are significantly related to diseasefree survival in HCC
Gene expression profiles with pathologic diagnosis and clinical data from 319 HCC patients were retrieved from TCGA database. Based on the ESTIMATE algorithm, stromal scores ranged from -1,741.56 to 1,195.07, and immune scores ranged from -1,209.16 to 2,934.36 ( Figure 1A). The patients were then divided into high and low score (divided based on median score) groups to determine the potential correlation between disease-free survival and immune or stromal scores. Kaplan-Meier survival curves revealed that patients in the high immune score group had a longer median disease-free survival time than those in the low score group (p = 0.0081 in log-rank test). ( Figure 1C and 1D).

Correlation between gene expression and immune scores
We next evaluated the 319 HCC cases from TCGA database to assess the relationship between gene expression and immune scores. Using a fold change of |logFC |> 1 and significance threshold of P <0.05, we selected 1195 differentially expressed genes (DEGs), of which 1144 were overexpressed and 51 were underexpressed. A heat map of the DEGs is shown in Figure 1B. Each row represents one gene, and each column represents one sample. The samples are sorted from left to right based on the immune score; the blue group on the left are the samples from the high score group, while the pink group on the right are the samples from the low score group. The genes were ranked according to the P-value of the differential expression analysis from smaller to larger. Red indicates overexpression, while green indicates underexpression, and the darker the red or green color, the greater the difference in expression.
Functional enrichment analysis was performed to study the function of the DEGs. Gene Ontology (GO) analysis revealed that the DEGs were involved with "plasma membrane," "immune process," "immune response," and "signaling receptor" activities. In addition, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis showed significant involvement of the chemokine signaling pathway and cytokine-cytokine receptor interaction pathway. Figure 2 shows the top ten enrichment results for GO annotation and KEGG pathway enrichment, and Supplementary Tables 1-4 list the specific enrichment information.
We performed a survival analysis and generated Kaplan-Meier survival plots to examine the possible impact of each DEG on disease-free survival. Of the 1195 DEGs, 214 predicted poor or good disease-free survival in the log-rank test (Supplementary File 1 list the total 214 genes, Figure 3 list 8 of 214 DEGs) (p<0.05).

Protein-protein interactions network and functional enrichment analysis from genes of significant value
We used the STRING database to generate proteinprotein interaction (PPI) networks to explore the interactions among the 214 predictive DEGs. We detected 7 modules in a network that included 156 nodes and 1,049 edges ( Figure 4A). The two most significant modules were selected for subsequent analysis. In module 1 ( Figure  Functional enrichment clustering of these genes exhibited a strong association with immune responses. Top 10 GO terms included "signaling receptor activity," "immune response," "immune system process," and "cell periphery, extracellular region." In addition, KEGG pathway analysis showed all the pathways were AGING Figure 1. Immune scores and stromal scores are associated with HCC disease-free survival. (A) TCGA liver cancer expression profile data using ESTIMATE method to calculate immune score and matrix score. Box-plot shows that the level of Immune scores and stromal scores. (B) Heatmap of the DEGs of immune scores of top half (high score) vs. bottom half (low score). p<0.05, fold change >1). Genes with higher expression are shown in red, lower expression are shown in green, genes with same expression level are in black. (C) HCC cases were divided into two groups based on their immune scores. Median disease-free survival of the high score group is longer than low score group (log-rank test, p<0.05). (D) Similarly, HCC cases were divided into two groups based on their stromal scores. The median disease-free survival of the low score group is longer than the high score group (log-rank test p=0.43), however, it is not statistically different.  associated with immune responses (Figure 5A-5D). Significant GO terms identified include 5 for "molecular function," 12 for "cellular component" and 30 for "biological process." Supplementary Tables 5-8 lists the specific enrichment information.

GEO database and clinical sample validation
To determine whether the aforementioned genes are also of prognostic significance in other HCC cases, a gene expression dataset (GSE14520) from a different HCC cohort (221 cases) were downloaded and analyzed. Among the 214 predictive DEGs, 13 were confirmed to be significantly associated with clinical prognosis. Of those, eight genes (TNFSF8, CD3E, ITK, KLRD1, PRKCQ, TRAF3IP3, PHLDA2, C11orf21) were of particular interest because their differential expression had not been previously reported in HCC patients ( Figure 6). We used qRT-PCR to assess expression of these eight genes of interest in 10 pairs of fresh HCC and adjacent nontumor tissues. The results showed that levels of the transcripts of these eight genes were frequently higher (p<0.05) in the corresponding nontumor tissues than to the HCC tissues (Figure 7).

Correlation between expression of genes of interested and expression of immune checkpoint gene
Using TCGA datasets, we found that expression of the eight genes of interest correlated positively with the mRNA expression of the immune checkpoint gene PDCD1 (Figure 8) (p<0.05). Of the eight genes, the strongest correlation was between CD3E and PDCD1.

DISCUSSION
In the present study, we used a dataset from TCGA to identify genes related to the tumor microenvironment that influenced disease-free survival. By comparing between groups with high or low immune scores and  AGING performing a GO term functional analysis, we initially extracted 1195 differentially expressed genes involved in the tumor microenvironment. We then carried out a disease-free survival analysis, which revealed that 214 of the genes were related to poor disease outcomes in HCC patients. Finally, using a validation cohort from the GEO database (GSE14520), we detected 13 genes related to tumor microenvironment that correlated significantly with prognosis. Of those five (GZMA, CD79A, IGJ, CYP3A4, SPP1) are reportedly predictive of overall survival or are involved in HCC pathogenesis (Supplementary Table 9). The remaining eight genes, including TNFSF8, CD3E, ITK, KLRD1, PRKCQ, TRAF3IP3, PHLDA2, and C11orf21, have not previously been reported to impact the clinical prognosis of HCC patients, and may have the potential to serve as new biomarkers for HCC. There was also a strong relationship between these eight genes, especially CD3E, ITK and TRAF3IP3, and the immune checkpoint gene PDCD1, which suggests HCC patients showing high expression of CD3E, ITK and TRAF3IP3 may respond well to immunotherapy.
PPI network analysis revealed interleukin-7 receptor (IL7R) and killer cell lectin-like receptor K1 (KLRK1) to be highly interconnected nodes (Figure 4). IL7R plays a vital role in lymphocyte development. For example, IL7R-deficiency may contribute to severe AGING combined immunodeficiency (SCID). Moreover, it was recently reported that IL7R is associated with the risk of HCC [14][15][16]. KLRK1, also called NKG2D (NKG2-Dactivating NK receptor), binds noncovalently with the DAP10 signaling protein to deliver costimulatory or activating signals to T and NK cells. Sheppard et al. suggested NKG2D promotes tumor growth in a chronic inflammation model of HCC [17,18].
Statistics from several large-scale clinical trials of cancer patients demonstrate that the number, type, and region of infiltrating lymphocytes within tumor tissue are decisive for predicting clinical outcomes [8, [19][20][21]. Previous studies of colon cancer [22], ovarian cancer [23,24], lung cancer [25], and melanoma [26] showed that immune cell activation-related genes are upregulated in patients with good prognoses. Pages et al. [27] used immunological scoring methods to follow up the survival and recurrence in stage I and II colon cancer patients. They found that both overall and disease-free survival were significantly better in patients with higher immune scores than in those with lower immune scores. Patients with an immune score of 4 had more prolonged overall survival, and 95% of these patients had no tumor recurrence within 18 years after surgery. On the other hand, 50% of patients with an immune score of 0 had tumor recurrence within 2 years after surgery. In the present study, higher immune scores associated with better prognosis in HCC patients, which is consistent with that earlier report. These findings may offer a different perspective on the complex interaction between tumors and their immune environment in HCC. However, there remains a need for further studies on these genes, which we anticipate will provide additional new insight into the impact of the tumor microenvironment in HCC.

ESTIMATE (Estimation of STromal and Immune cells in MAlignant
Tumour tissues using Expression data) is an algorithmic tool. The detail algorithm can be seen at Supplementary File 3. Yoshihara and his colleagues [28] calculate stromal and immune scores to predict the level of infiltrating stromal and immune cells through single-sample gene set-enrichment analysis (ssGSEA), and these scores make up the basis of the ESTIMATE score to infer tumour purity in tumour tissue.

Clinical samples
Ten pairs of fresh HCC specimens and adjacent nontumor tissue were collected from Southern Medical University, Zhujiang Hospital (Guangzhou, China). The human specimens used for validation in this study were collected between May and June 2020. Use of these specimens was approved by the local ethics committee. The adjacent samples were taken at a distance of at least 5 cm from the tumor, and all tissues were examined histologically. None of the patients had received preoperative chemotherapy or radiotherapy.

RNA extraction and quantitative real-time polymerase chain reaction (qRT-PCR)
Total RNA was extracted from the tissue specimens using TRIzol (Invitrogen), and qRT-PCR was performed with SYBR Green Dye (Takara, Dalian, China) according to the manufacturer's instructions. The primer sequences are listed in Supplementary File 2.

Identification of DEGs and Heatmap and clustering analysis
Data were analyzed using R software (version 3.6.2) and its package Limma. |logFC|>1 and p < 0.05 were set as the cut-offs to screen for DEGs. Heatmaps and clustering were generated using the R software.

PPI network construction
The STRING network database (https://string-db.org/) was employed for construction of PPI (protein-protein interaction) networks using the Cytoscape app. We then selected specific networks with ten or more nodes for subsequent analysis, and the degree of connectivity for each network node was calculated. To detect densely connected regions, we used the Cytoscape plug-in clustering algorithm called Molecular COmplex DEtection (MCODE) to identify clusters based on their topological features.

Recurrence-free survival curve analysis
We carried out a Kaplan-Meier analysis using survival R package to assess the association between recurrence-free survival among patients and the levels of DEG expression. The constructed curves were compared using the log-rank test.

Enrichment analysis of DEGs
We used the STRING network database to perform functional enrichment and pathway enrichment analyses of the DEGs. The enrichment content included GO, Reactome Pathways, Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways, UniProt Keywords, PFAM Protein Domains, SMART Protein Domains, INTERPRO Protein Domains and Features, and other databases. The pathway enrichment analysis was based on references from the KEGG pathways. A false discovery rate (FDR) < 0.05 was used as the cut-off.

Correlation analysis
To explore the relationship of the 8 novel genes identified through the data mining with immune checkpoint gene (PDCD1) expression, We used TCGA database and R software to perform correlation analyses.

ACKNOWLEDGMENTS
The authors thank Aging Editing Service for editing the manuscript.

CONFLICTS OF INTEREST
The authors declare there is no conflicts of interest regarding the publication of this paper.