A novel 10 glycolysis-related genes signature could predict overall survival for clear cell renal cell carcinoma

The role of glycolysis in tumorigenesis has received increasing attention and multiple glycolysis-related genes (GRGs) have been proven to be associated with tumor metastasis. Hence, we aimed to construct a prognostic signature based on GRGs for clear cell renal cell carcinoma (ccRCC) and to explore its relationships with immune infiltration. Clinical information and RNA-sequencing data of ccRCC were obtained from The Cancer Genome Atlas (TCGA) and ArrayExpress datasets. Key GRGs were finally selected through univariate COX, LASSO and multivariate COX regression analyses. External and internal verifications were further carried out to verify our established signature. Finally, 10 GRGs including ANKZF1, CD44, CHST6, HS6ST2, IDUA, KIF20A, NDST3, PLOD2, VCAN, FBP1 were selected out and utilized to establish a novel signature. Compared with the low-risk group, ccRCC patients in high-risk groups showed a lower overall survival (OS) rate (P = 5.548Ee-13) and its AUCs based on our established signature were all above 0.70. Univariate/multivariate Cox regression analyses further proved that this signature could serve as an independent prognostic factor (all P < 0.05). Moreover, prognostic nomograms were also created to find out the associations between the established signature, clinical factors and OS for ccRCC in both the TCGA and ArrayExpress cohorts. All results remained consistent after external and internal verification. Besides, nine out of 21 tumor-infiltrating immune cells (TIICs) were highly related to high- and low- risk ccRCC patients stratified by our established signature. A novel signature based on 10 prognostic GRGs was successfully established and verified externally and internally for predicting OS of ccRCC, helping clinicians better and more intuitively predict patients’ survival.

development of effective early screening and diagnosis methods are essential for improving the treatment effect and life quality for these patients.
In recent years, there have been more and more researches on the metabolic changes of tumor cells. Warburg effect is the most common and widely studied metabolic change in cancer cells, that is, in the presence of oxygen, malignant tumor cells have an inherent tendency to incompletely oxidize glucose [4]. Studies have revealed that many tumors have enhanced glucose uptake in adjacent tissues, and high glucose uptake rates are simultaneously present with suppressed glucose oxidation [4]. Glucose is converted into lactic acid during glycolysis, and cancer cells obtain maximum energy in this way. This phenomenon is ubiquitous in tumors [5], and this metabolic change is particularly noticeable in kidney cancer [6].
Accumulating evidence have reported that glycolysis-related genes (GRGs) are differently expressed in various malignant cancers and play important roles in tumorigenesis and progress. For example, studies have presented that about 90% of patients with sporadic ccRCC have mutations of VHL gene [7]. When this gene is deleted, HIF-α accumulates, and genes such as VEGF, PDGF, TGF-α, and MMP are activated to participate in neovascularization formation, cell proliferation, infiltration and distant metastasis promote the development of tumors [8]. One of the key enzymes of glycolysis, hexokinase 2 (HK2), has turned out to be abnormally expressed in ccRCC and can promote cell proliferation and invasion [9]. Multiple genes, such as FBP1, PLOD2, VCAN, and CD44 have been demonstrated to participate in epithelial-mesenchymal transition (EMT) promotion of tumor metastasis [10][11][12][13]. More and more researches on the relationships between glycolysis and tumor occurrence and development have emerged. Glycolysis roles in the progress of RCC have also been confirmed. However, there is currently no prognostic model for ccRCC based on GRGs. Hence, this article is committed to establish a novel GRGs-related prognostic signature and to explore its associations with immune infiltration for predicting ccRCC patients' overall survival (OS).

Data collection and identification of differentially expressed GRGs
Clinical information and RNA-sequencing FPKM values of 539 ccRCC tumor samples and 72 normal tissues were obtained from The Cancer Genome Atlas (TCGA, https://portal.gdc.cancer.gov/) dataset, with a mixture of different histologic types of ccRCC included in this study. All raw data were preprocessed by normalization, log 2 transformation and excluding average count values of genes < 1. Differentially expressed GRGs were identified by the "Limma" package, in setting the cut-off values of |log2 fold change |>1 and false discovery rate (FDR) < 0.05. In addition, the E-MTAB-1980 dataset from the ArrayExpress database (https://www.ebi.ac.uk/ arrayexpress/) including 99 ccRCC tumor samples were served as an external verification cohort.
Gene ontology (GO) and Kyoto encyclopedia of genes and genome (KEGG) pathway enrichment analyses Through the enrichment of GO and KEGG pathway enrichment analyses, differently expressed GRGs' biological functions were comprehensively evaluated. Therein, GO analysis terms included molecular functions (MF), cellular components (CC) and biological processes (BP). All functional annotation and pathway enrichment analyses were conducted using the R package of "clusterProfiler".

Protein-protein interaction (PPI) network and related module screening
In order to evaluate the PPI network, we submitted these differentially expressed GRGs to the online tool STRING (http://www.string-db.org/) and visualized the PPI network by Cytoscape 3.7.0 software. By means of the Molecular Complex Detection (MCODE) plug-in, we screened out the top three hub modules identified from the PPI network with the cut-off criteria of a node degree > 3, a combined score > 0.9 and P < 0.05.

Prognostic model construction and validation
We utilized data in the TCGA database as the training queue and the data in the ArrayExpress database as the external verification queue. Univariate Cox regression was performed to analyze associations between differently expressed GRGs and OS for ccRCC in the TCGA database, and 40 candidate genes were selected. The least absolute contraction and selection operator (LASSO) method was used to improve accuracy and to select out the optimal gene combination. Finally, 10 genes were screened out by multivariate Cox regression analysis to establish a signature. According to the genes' expression levels and their regression coefficients, a risk signature is established. The riskscore formula for each ccRCC patient was displayed as following: Therein, β represented the regression coefficient and exp. was the genes' expression levels. Taking the median riskScore in the TCGA training database as the threshold, the eligible ccRCC patients were classified into a high-risk and a low-risk group, and Kaplan-Meier survival curve was drawn to show the prognosis difference between the two groups of patients. Besides, our established signature was carefully evaluated in three validations sets including the external validation dataset (ArrayExpress), the internal validation dataset 1 (test 1) as well as the internal validation dataset 2 (test 2).

Univariate/multivariate cox regression analyses and nomogram construction
Univariate and multivariate Cox regression analyses were applied to evaluate whether our established signature and various clinical parameters could be served as an independent prognostic factor for ccRCC. Six clinical factors and riskscore were utilized to establish the prognostic nomogram, evaluating the ccRCC patients' 1-, 3-and 5-year OS. C-index and the area under the curve (AUC) were calculated to assess nomogram accuracy. Calibration charts were applied to intuitively assess nomogram prognostic ability. In addition, the prognosis nomogram was also verified in external validation cohort from ArrayExpress database.
Verification of the GRGs' protein expression in HPA database and survival analysis in Kaplan-Meier plotter website By means of the Human Protein Atlas (HPA) dataset (http://www.proteinatlas.org/), we validated the 10 hub GRGs' protein expression levels in ccRCC by immunohistochemical staining. Kaplan-Meier plotter website (http://kmplot.com/analysis/) was also applied to assess the prognosis of these 10 hub GRGs in ccRCC by survival analysis.

Estimation of tumor-infiltrating immune cells (TIICs)
The TIICs expression levels in all ccRCC samples were calculated by the "limma" package of R software. Then, algorithm with LM22 gene signature and 1000 permutations were calculated to find whether these TIICs are highly related to high-and low-groups stratified by our established signature [14]. P-value below 0.05 above mentioned was set as threshold.

Statistical analysis
R software 3.6.3 was utilized to calculate all statistical analyses. Survival curves were drawn by the Kaplan-Meier method with log-rank test. Univariate COX, LASSO and multivariate COX regression analyses were utilized to calculate the regression coefficient and to establish a signature. All statistical tests are bilateral. P-value below 0.05 was regarded to be statistically significant.

Identification of differently expressed GRGs
The workflow of our study was presented in Supplement Figure S1. A total of 611 patients were extracted from the TCGA database, containing 539 renal cancer samples and 72 normal specimens with a mixture of different histologic types of ccRCC included in this study. The R package of "Limma" was applied to discover differentially expressed mRNAs (DE-mRNAs) with a thresholds of |log 2 (FC)| > 1 as well as FDR less than 0.05, 113 differently expressed GRGs were screened from the GRG list, including 44 down-regulated and 69 up-regulated GRGs. Their expression heatmap and volcano plot were shown in Fig. 1a-b.

Functional annotation and pathway enrichment
Results of GO and KEGG analyses presented that differently expressed GRGs were remarkably enriched in the biological processes (BP) analysis associated with carbohydrate catabolic process, pyruvate metabolic process, glycolytic process, hexose metabolic process, ATP generation from ADP, pyridine nucleotide metabolic process, glucose metabolic process, pyridine−containing compound metabolic process, nicotinamide nucleotide metabolic process pyruvate biosynthetic process. Through the molecular function (MF) analysis, they were significantly enriched in oxidoreductase activity, monosaccharide binding, coenzyme binding, carbohydrate phosphatase activity, organic acid binding, sugar−phosphatase activity carboxylic acid binding, oxidoreductase activity, NAD or NADP as acceptor, carbohydrate kinase activity, vitamin binding. Concerning the cellular component (CC), differently expressed GRGs were notably enriched in lysosomal lumen, Golgi lumen, oxidoreductase complex, vacuolar lumen, mitochondrial matrix, proteinaceous extracellular matrix, endoplasmic reticulum lumen, extracellular matrix. KEGG pathway analysis is significantly enriched in HIF-1 signaling pathway, Glycolysis/Gluconeogenesis, Carbon metabolism, Glucagon signaling pathway, Pyruvate metabolism, Fructose and mannose metabolism, Tyrosine metabolism, AMPK signaling pathway, Biosynthesis of amino acids, Galactose metabolism, Insulin signaling pathway, suggesting that these prognostic genes are indeed associated with glycolysis ( Fig. 1c-d).

PPI network integration and the top 3 key modules selection
In order to further study their roles of differentially expressed GRGs in RCC, we used the STRING database to reveal the relationships between differentially expressed GRGs and visualized by Cytoscape software (Fig. 2a). We also used Cytoscape's MODE tool to process the co-expression network and to identify the top three key modules in the PPI network ( Fig.  2b-d).

Construction a prognostic model
To construct a prognostic model, univariate regression analysis was done to find out 40 candidate GRGs (Fig. 3a). The LASSO Cox regression model was applied to avoid overfitting of the model (Fig. 3b-c). Finally, 10 genes were screened out by multivariate Cox regression analysis to establish a signature ( Fig. 3d and Table 1). The riskscores of each patient were shown as following: Fig. 1 The differentially expressed GRGs identified in ccRCC. a Heatmap; b Volcano plot. The red nodes represent the up-regulated genes Green nodes represent the down-regulated genes (P value < 0.05 and |log2(FC)| > 1); c GO enrichment of differently expressed GRGs; d KEGG pathway analysis of differently expressed GRGs

Validation the expression and prognosis of key GRGs
Based on our established glycolysis signature, 539 ccRCC cases were subdivided into a high-risk and a low-risk group, accordingly to the median riskscore. Kaplan-Meier survival analysis indicated that ccRCC patients in low-risk groups could have a markedly longer OS than patients in high-risk groups (P = 5.548e− 13, Fig. 4a). The signature showed superior predictive veracity of patients' OS with the 1-, 3-and 5-year AUC values of 0.724, 0.716 and 0.741, separately ( Fig. 4b-d, Table 2). Besides, the riskscore for each sample was also calculated and ranked and the heatmap showed the expression value of ten significant GRGs between highand low-risk groups. As the risk score increased, ccRCC patients would have a shorter survival time and more dead events (Fig. 4i). Three validation datasets containing the external validation dataset, the internal validation dataset 1 as well as the internal validation dataset 2 were used to verify the risk score signature had similar predictive values in different populations. Survival analysis showed that all three testing cohorts had similar outcomes (Fig. 4e Fig. 5a and e). The 1-, 3-and 5-year AUC values of OS in three testing cohorts were all above 0.70, also showed a favorable predictive ability (Figs. 4f-h, Fig.  5, Table 2). The expression heatmap of ten hub GRGs, survival status and risk score of signature of ccRCC patients were displayed in Fig. 5.

Determination of the GRGs risk model as an independent prognostic factor
Univariate and multivariate Cox regression analyses were applied to evaluate whether our established signature and various clinical parameters could be served as an independent prognostic factor for ccRCC. In the TCGA cohort, univariate analysis showed the glycolysis signature, age, grade, stage, T and M were all remarkably related to OS (all P < 0.001). Multivariate analysis    indicated that the glycolysis signature, age, grade and stage were all marked correlated with OS (all P < 0.01). Therefore, the prognostic glycolysis signature constructed by the TCGA training set was an independent prognostic factor for ccRCC (Fig. 6, Table 3).
Nomogram construction based on the clinical characteristic and ten GRGs' signature To predict ccRCC patients' prognosis, a prognostic nomogram was constructed by TCGA dataset to predict the 1-, 3-and 5-year OS for ccRCC. Severn prognostic parameters were included in the prediction model, including the riskScore, age, gender, grade, T, M and N (Fig. 7a). The nomogram showed good predictive power of 1-, 3-and 5-year OS rates for ccRCC. The 1-, 3-and 5-year AUC and their C-index were 0.836, 0.799, 0.765 and 0.781 in the TCGAs database respectively (Table 4, Supplement Figure S2). Calibration charts (Fig. 7c) showed that the 1-, 3-and 5-year survival rates of the TCGA cohort predicted were in excellently consistent with actual observations. In addition, we constructed another prognosis nomogram as an external verification set to verify the previous results in ArrayExpress dataset (Fig. 7b). Its C-index was 0.86 and the AUC values were 0.888, 0.915, and 0.902 (Table 4, Supplement Figure S2) respectively, indicating that the nomogram also has good predictive power for OS rates in the testing set. The calibration chart shows excellent agreement between the 1-, 3-and 5-year OS predictions and actual observations of ccRCC patients in the external testing set (Fig. 7d).

Prognostic value of 10 key GRGs and their associations between our established signature and clinical factors
Based on the median expression, survival analyses of 10 GRGs (ANKZF1, CD44, CHST6, HS6ST2, IDUA, KIF20A, NDST3, PLOD2, VCAN, FBP1) by Kaplan-Meier Plotter website were displayed in Fig. 8. The correlations between our established and clinical factors were also analyzed and revealed that the established signature was firmly associated with grade (P = 0.030), stage (P = 0.036) and staged T (P = 0.042) ( Table 5, Supplementary Figure S3).

Validation of the expression of 10 critical GRGs in ccRCC from HPA database
As detailed in Supplementary Figure S4, immunohistochemical staining of these 10 critical GRGs from HPA database were utilized to verify their protein expressions. Compared with normal kidney tissues, antibody stainings for ANKZF1, CD44, IDUA, KIF20A, PLOD2, and VCAN were high in ccRCC tumor tissues, whereas they were low for CHST6, HS6ST2, NDST3 and FBP1.

Clinical factors stratified by the riskScore for OS
Our results indicated that our established riskScore was capable of predicting OS in age ≤ 65, age > 65, Female, Male, Grade 1-2, Grade 3-4, M0, N0, White, Stage I-II, Stage III-IV, T1-2 stage, T3-4 stage (all P < 0.01; Fig. 9).  Associations between TIICs and our established signature for ccRCC As displayed in Fig. 10a-i, nine out of 21 TIICs (T cells CD4 memory activated, Plasma cells, T cells gamma delta, T cells regulatory (Tregs), Macrophages M0, Monocytes, Macrophages M1, Mast cells resting, Dendritic cells resting) were highly associated with high-and low-risk ccRCC patients stratified by our established riskScore (all P < 0.05). As detailed in Fig. 10j, radar chart showed the relationships of 21 TIICs between  high-risk and low-risk ccRCC patients and nine out of these 21 TIICs were statistically significant (all P < 0.05).

Discussion
According to GLOBOCAN statistics, there were approximately 403,262 new cases of kidney cancers and 175,098 death worldwide in 2018 [15]. Of all human malignancies, kidney cancer accounted for about 2-3% [2].
At present, few studies had focused on the expression pattern of GRGs and their roles in predicting ccRCC survival. Hence, we identified differentially expressed GRGs between tumor samples and normal tissues based on TCGA raw data to construct a PPI network and to perform GO and KEGG pathway enrichment analysis. In addition, we also conducted univariate COX, LASSO and multivariate COX regression analyses to establish a signature for predicting the ccRCC patients' prognosis based on 10 prognostic-related GRGs and further  Table 5 Clinical correlation analysis between 10 prognostic GRGs, riskscore and clinical features  explored its associations with immune infiltration. These works may help to identify novel effective biomarkers for ccRCC prognosis. According to reports, there are metabolic changes in a variety of cancers, and glycolysis is the most common and widely studied metabolic change [16]. It was the first time for us to identify differently expressed GRGs and then screen out 10 prognostic-related GRGs (ANKZF1, CD44, CHST6, HS6ST2, IDUA, KIF20A, NDST3, PLOD2, VCAN, FBP1), based on TCGA, by univariate/ multivariate Cox regression and LASSO regression analyses. Some studies have shown that these GRGs play important roles in the tumorigenesis and progress of various tumors, including renal clear cell carcinoma. For example, the expression of HS6ST2 was up-regulated in tumors of gastric cancer, and related to the poor prognosis [17]. Liep J et al. found that overexpression of miR-145-5p and miR-141-3p could inhibit the migration and invasion of RCC cells by influencing the HS6ST2 expression [18]. Regarding KIF20A, it was reported to be overexpressed in various tumors such as bladder cancer, cervical cancer, glioma, lung adenocarcinoma, ovarian clear cell carcinoma, colorectal cancer, liver cancer, prostate cancer, gastric cancer, etc., and often indicates a poor prognosis and poor clinicopathological features [19][20][21][22][23][24][25][26][27][28]. Studies had shown that KIF20A might promote the proliferation, invasion and migration of tumor cells by activating the JAK2/STAT3 pathway [23,26]. Asahara S et al. has also developed cancer vaccine reagent KIF20-66 for the treatment of pancreatic cancer, which has a beneficial therapeutic effect on advanced pancreatic cancer [29]. Epithelial mesenchymal transformation (EMT) is one of the key steps that cause distant metastasis of tumors [30]. FBP1, PLOD2, VCAN, CD44 are all proved to relate to EMT. Studies have shown that PLOD2 is regulated by many factors, such as HIF-1α, TGF-β and microRNA [31]. FBP1 is a rate-limiting enzyme for gluconeogenesis and has recently been considered as a tumor suppressor for various cancers [32]. It is proved that FBP1 interacts with HIF to inhibit the function of nuclear HIF, inhibit glycolysis and pentose phosphate pathway and inhibit the proliferation of renal cancer cells [10]. FBP1 overexpression inhibits the proliferation, migration, invasion and tumorigenesis of cholangiocarcinoma cells by inhibiting the Wnt/β pathway [33]. FBP1 gene silencing could activate the MAPK pathway and then promote cell EMT, invasion and metastasis in prostate cancer [34]. Several studies have found that PLOD2 induces epithelial-mesenchymal transition (EMT) mainly through the PI3K / AKT signaling pathway [11]. Mitsui Y et al. found that VCAN promotes tumorigenesis and metastasis of ccRCC [35]. VCAN knockdown significantly reduced the proliferation of renal cancer cells and increased apoptosis, which is linked to the changes of several TNF signaling related genes such as TNFα, BID and BAK [35]. TGF-β had the ability to enhance the invasiveness of ovarian cancer cells by up-regulating VCAN in fibroblasts (CAF) [36]. Up-regulated VCAN could also promote the invasion and movement of ovarian cancer cells by up-regulating CD44 and activating the NF-κB signaling pathway, matrix metalloproteinase 9 and hyaluronan-mediated movement receptors [36]. Wu et al. have shown that CD44 + bladder cancer cells have a higher invasive ability [13].
Current research shows that clinical and pathological features such as age and metastasis are not sufficient to accurately evaluate cancer patients' survival. At present, we are looking for effective biomarkers to predict the survival of cancer patients. However, single gene expression levels will be affected by many factors, making these markers unreliable to be independent prognostic indicators. Therefore, a statistical model composed of multiple related genes is much more accurate in evaluating the prognosis of tumor patients than using a single biomarker, so this model has been extensively used. Few studies have concentrated on the role of glycolysis in the prediction of ccRCC prognosis. We established a new prognostic signature based on the expression of 10 GRGs. Based on this risk scoring model, ccRCC patients were classified into a high-group and low-risk group. Patients with high-risk scores had significantly lower OS compared to those with low-risk scores. In addition, we combined the established riskscore and multiple clinical Fig. 10 Associations between tumor-infiltrating immune cells (TIICs) and GRGs based riskscore signature in ccRCC; a Dendritic cells resting between high-and low-risk ccRCC patients; b Macrophages M0 between high-and low-risk ccRCC patients; c Macrophages M1 between highand low-risk ccRCC patients; d Mast cells resting between high-and low-risk ccRCC patients; e Monocytes between high-and low-risk ccRCC patients; f Plasma cells between high-and low-risk ccRCC patients; g T cells CD4 memory activated between high-and low-risk ccRCC patients; h T cells gamma delta between high-and low-risk ccRCC patients; i T cells regulatory (Tregs) between high-and low-risk ccRCC patients; j Radar chart showed the difference of immune cell infiltration abundances in ccRCC subtypes parameters to construct a nomogram to predict 1-, 3-, and 5-year OS in ccRCC patients. The calibration chart based on the TCGA database shows that the predicted value and the observed value are very close, indicating that the prediction performance of nomogram is very good. Similarly, it is checked in the external verification set and the two internal verification sets. Therefore, our new prognosis nomogram may be better than the original clinical factors to help clinicians predict the survival status for ccRCC and provide specific individualized treatment.
As far as we knew, this was the first signature to predict the OS of ccRCC patients based on GRGs. We successfully established a risk signature based on GRGs and verified it in three verification sets including one external verification data set (ArrayExpress) and two internal verification data sets (test 1 and test 2). Our results remained stable both internally and externally. Of course, this study also had several limitations. Firstly, as a bioinformatics analysis article, our research is retrospective and the clinical information downloaded from the online database was incomplete and limited. Secondly, 10 glycolysis-related genes had not been experimentally verified and our constructed nomogram had not been validated by our own data. Thirdly, due to limited clinical information, we had difficulties in comparing our established signature with other available tools in the clinic, such as the Heng score (IMDC).

Conclusions
Taken together, 10 GRGs including ANKZF1, CD44, CHST6, HS6ST2, IDUA, KIF20A, NDST3, PLOD2, VCAN, FBP1 were selected out and utilized to establish a novel signature. The GRGs based signature was successfully established and verified externally and internally for predicting OS, helping clinicians better and more intuitively predict ccRCC patients' survival. As an independent prognostic factor, our established signature showed excellent predictive efficacy for ccRCC and significantly associated with immune infiltration. Further researches were required to verify our findings.