Identification of DEERGs and functional enrichment analysis
By comparing tumor and normal tissue samples, there were 2906 genes differentially expressed, where 1384 DEGs up-regulated and 1522 DEGs down-regulated (Fig .1A). The heat map shows the expression of the first 15 up-regulated and down-regulated genes (Fig. 1B). After overlapping DEGs with 720 ERGs, we obtained 56 DEERGs (Fig .1C). In tumor samples, 36 of 56 DEERGs were up-regulated and 20 were down-regulated (Fig. 1D). The locations of the 56 DEERGs on chromosomes are shown in (Fig. 2).
To obtain the functions of these 56 DEERGs, GO function analysis of these 56 genes showed that they were involved in histone modification, chromatin organization and peptidyl-lysine modification (Fig. 3A-1). KEGG pathway analysis showed that they were involved in viral carcinogenesis, homologous recombination, cell cycle and Fanconi anemia pathway (Fig .3B-1). GO function analysis of these 56 genes showed that they were involved in histone modification, chromatin organization and peptidyl-lysine modification (Fig .3A-2). KEGG pathway analysis showed that they were involved in viral carcinogenesis, homologous recombination, cell cycle and Fanconi anemia pathway (Fig .3B-2).
Establishment and validation of the prognostic model
To construct epigenetic-related signature for survival prediction, we conducted univariate cox regression on the 56 DEERGs and selected 19 genes that were significantly linked with OS in training set (Fig. 4A). Inputting 19 genes into the LASSO model, eight genes were identified (Fig .4B, C). Among them, PHF19, AURKA, CHAF1B and AURKB were up-regulated in the tumor group, NAP1L2, TONSL, SATB2 and HDAC9 were down-regulated in the tumor group (Fig .4D). Furthermore, we determined the formula of risk score: (-0.047 × expression value of SATB2) + (0.058 × expression value of HDAC9) +(0.153 × expression value of NAP1L2) + (-0.024× expression value of PHF19) + (-0.004 × expression value of AURKB) + (-0.052 × expression value of TONSL) +(-0.159 × expression value of AURKA)+(-0.138 × expression value of CHAF1B). Then CRC patients were classified as the high- and low-risk groups according to the median value of risk scores in the GSE40967.
(Fig. 5A ) and(Fig. 5B) demonstrated the risk scores and survival status between the high and low risk groups. Obviously, the high-risk group had poor prognosis of GC compared with low-risk group in the GSE40967 (Fig .5C). ROC curve showed the AUC of risk score for 1-, 2-, 3- year survival status prediction was 0.72, 0.68, 0.66, indicated that risk score had moderate performance in predicting patient’s survival status (Fig .5D-F). In the validation set, Kaplan–Meier analysis also showed a significant difference of overall survival (OS) (Fig. 5G-I) between two groups (high-risk and low-risk). AUC values of the risk model for 1–3 years in all the three cohorts were also greater than 0.6 (Fig .5J-L).
Clinical feature analysis and GSVA analysis
We assessed the relevance between the clinicopathological traits and risk score, including gender and TNM stage. The risk score was significantly increased in advanced TNM stage cases (Fig. 6A-C) and the risk score was not significantly different in gender (Fig. 6D). The results showed that there was a powerful correlation between risk score and TNM stage.
We performed GSVA analysis with annotations of GO and KEGG gene sets to examine the potential biological functions between risk groups of CRC patients. The gene sets involved in hypertrophic cardiomyopathy HCM, negative regulation of leukocyte migration, sarcolemma and phosphatidylinositol 3 kinase binding were enriched in the high-risk group, while those related to DNA replication, DNA strand elongation involved in DNA replication, chromosome passenger complex and snoRNA binding were enriched in the low-risk group (Fig .7A-D).
Immune analysis of the high and low risk groups
We calculated immune/stromal scores and their correlation with risk scores. The results revealed that both the immunity score (cor = 0.414) and the stroma score (cor = 0.437) were significantly and positively correlated with the risk score (p < 0.05). (Fig .8A, B).
Then we used CIBERSORT databases to assess the percentage of immune infiltrating cells in patients (Fig. 9A). Then we obtained 5 differential immune cells by CIBERSORT. The main differential immune cells between the risk groups (high and low) included NK cells resting, eosinophils, mast cells resting, T cells CD4 memory activated and mast cells active (Fig. 9B).
Furthermore, the expression of immune checkpoints were compared between the risk groups (high and low), the results showed that the expressions of CDK4, CD48, CD155, B7H5, GEM, CD134L, CD27, CD86, FAS, TIM3, TIGIT, BTLA, CD160, PDL2, CD28, CD244, PDL1 and CD137L were found to be significantly different between the two groups (Fig .9C).
Correlations of risk model genes with m6A and m5C associated genes
We analyzed the expression patterns of 19 m6A regulators in CRC (Fig. 10A), and the results revealed that CBLL1, ELAVL1, FMR1, HNRNPA2B1, IGF2BP2, RBM15 AND YTHDF1 was significantly altered between the risk groups (high and low) (Fig .10B). Then, correlation analysis was performed on the expression of 19 m6A-related genes and risk model genes (Fig. 10C), and we found AURKA had the most correlation to YTHDF1(cor = 0.67). The correlation between other model genes and m6A-related genes were less than 0.5.
Then we evaluated the expressions of 20 m5C-related genes in CRC (Fig. 10D). The results revealed that MBD1, DNMT1, MBD3, SMUG1, ZBTB4, TET2, DNMT3A, TET3, UHRF1, DNMT3B, UNG and NTHL1 were significant difference between the risk groups (high and low). We detected the correlation analysis between risk model genes and 20 m5C-related genes (Fig. 10E), and we found that AURKB was positively correlated with DNMT1(cor = 0.67), UHRF1 (cor = 0.65) and UNG (cor = 0.5). PHF19 was significantly positively correlated with DNMT1 (cor = 0.55) and UHRF1 (cor = 0.53), AURKA was significantly positively correlated with DNMT3B (cor = 0.58) and DNMT1 (cor = 0.51), CHAF18 was significantly positively correlated with DNMT1 (cor = 0.56), UHFR1 (cor = 0.56) and UNG (cor = 0.51) (Fig. 10F).
Prediction of Targeted Drugs for AURKA, AURKB and HDAC9
By means of eight model genes, we prediction of potential drugs for the treatment of CRC (Fig. 11). Only three genes, AURKA, AURKB and HDAC9, received the predicted drugs. A total of 137 drug-gene interaction pairs including 103 drugs and 3 model genes were found to have interactions. Among them, AURKA, AURKB and HDAC9 targeted by 47, 58, 32 drugs, respectively. Among them, pazopanib, danusertib, entrectinib and sorafenib targeted AURKA and AURKB. Givnostat, apicidin, belinostat and largazole targeted HDAC9.
Analyses of Independent Prognostic and Construction of the nomogram in CRC
Importantly, TNM stage, age and risk score were significantly associated with prognosis in both univariate Cox analysis and mutivariate Cox analysis. Risk score, age, gender, TNM stage were included into univariate analysis (Fig. 12A), and risk score, age, T stage, N stage and M stage were used for multivariate analysis. The result indicated that risk score, age and N stage and M stage were independent prognostic factors in CRC (Fig. 12B). Thereafter, we constructed a nomogram to predict the 1-, 2-, and 3-year survival of CRC patients by using risk score, age N stage and M stage (Fig .12C). The calibration curves for 1-, 2-, and 3-year (Fig. 12D-F) showed that the nomogram-predicted probability of survival was close to the actual survival.
Experimental Verification of model genes
The expressions of the 5 prognostic epigenetic-related genes were validated by quantitative real-time polymerase chain reaction (qRT-PCR) using 20 pairs of CRC and adjacent tissues. PCR experiments were conducted in which the expressions of HDAC9, NAP1L2, SATB2 and TONSL were significantly downregulated in CRC, but the expression of CHAF1B was significantly upregulated in CRC (Fig .13), which were all consistent with previous screening results.