Methylation-Based Classification of Cervical Squamous Cell Carcinoma into Two New Subclasses Differing in Immune-Related Gene Expression

Cervical cancer is traditionally classified into two major histological subtypes, cervical squamous cell carcinoma (CSCC) and cervical adenocarcinoma (CA). However, heterogeneity exists among patients, comprising possible subpopulations with distinct molecular profiles. We applied consensus clustering to 307 methylation samples with cervical cancer from The Cancer Genome Atlas (TCGA). Fisher’s exact test was used to perform transcription factors (TFs) and genomic region enrichment. Gene expression profiles were downloaded from TCGA to assess expression differences. Immune cell fraction was calculated to quantify the immune cells infiltration. Putative neo-epitopes were predicted from somatic mutations. Three subclasses were identified: Class 1 correlating with the CA subtype and Classes 2 and 3 dividing the CSCC subtype into two subclasses. We found the hypomethylated probes in Class 3 exhibited strong enrichment in promoter region as compared with Class 2. Five TFs significantly enriched in the hypomethylated promoters and their highly expressed target genes in Class 3 functionally involved in the immune pathway. Gene function analysis revealed that immune-related genes were significantly increased in Class 3, and a higher level of immune cell infiltration was estimated. High expression of 24 immune genes exhibited a better overall survival and correlated with neo-epitope burden. Additionally, we found only two immune-related driver genes, CARD11 and JAK3, to be significantly increased in Class 3. Our analyses provide a classification of the largest CSCC subtype into two new subclasses, revealing they harbored differences in immune-related gene expression.


Introduction
Cervical cancer is the second most prevalent cancer in females worldwide and can be traditionally classified into two common histological subtypes: cervical squamous cell carcinoma (CSCC) and cervical adenocarcinoma (CA). Among these, CSCC accounts for approximately 80% of all cervical cancer cases [1], with most of the remaining cases being CA. Evidences have shown that differences exist between these two subtypes, including risk factors [2], incidence rates [3], clinical features [4], and mutations [5]. The treatment adopted for these two subtypes is typically similar [6], while previous studies have reported relatively different outcomes [1].
In addition to the histological difference, cervical cancer also shows heterogeneity related to microenvironments such as hypoxia, variation in response to treatment, risk of metastasis, and gene expression [7][8][9][10]. Intratumoral metabolic and gene mutation heterogeneities are also observed in cervical cancer [11,12]. It is possible that the subpopulations present will have distinct molecular profiles with different levels of intrinsic response to therapy. Accurate subclass identification and characterization of the underlying mechanism are pivotal for our understanding of the disease and for the guiding of personalized therapies.
With the emergence of genome wide profiling techniques, large genomic datasets have become available for the discovery of new cancer subclasses. DNA methylation arrays measure the methylation status of thousands of CpG sites (or CG nucleotide) across the genome [6], and can also be used for cancer classification. Increasing evidence has shown that methylation profiling may reveal additional complexity that is not captured at the expression level or through genetic profiling [13][14][15], being able to delineate biologically relevant tumor subgroups [13,16,17]. In cervical cancer, changes in DNA methylation have been reported to play a critical role during cervical tumorigenesis [18][19][20]. In particular, hypermethylation of promoters is associated with the silencing of tumor suppressor genes, such as apoptosis-related genes and those involved in the cell cycle, DNA repair, and WNT (Wnt is an acronym in the field of genetics that stands for 'Wingless/Integrated') pathways [18]. The changes in methylation may result in aberrant gene expression, which consequently modifies the biological characteristics of the cancer. This prompted us to assess whether the DNA methylation profiles could divide cervical cancer patients into any new subgroups.
The large cervical cancer dataset generated by high-throughput genomic technologies and provided by The Cancer Genome Atlas (TCGA) offers a rich resource and a new opportunity to decipher the biological variability of tumors. In the present study, we performed clustering analysis using 307 cervical cancer patients' DNA methylation data from TCGA. By combining the clustering results with the clinical information, we found that the largest cervical cancer subtype CSCC could be divided into two new subclasses, while the CA subtype stayed a single separate subclass. Based on this classification, we moved further to a detailed characterization of their differences by integrating gene expression analysis. We showed that these three subclasses displayed intrinsic differences in the methylation level and gene expression. Interestingly, we observed that for the CSCC subclasses the hypomethylation differences mainly occurred in the promoter regions. In one CSCC subclass, five transcription factors (TFs) showed enrichment in lowly methylated promoters and 28 of their highly expressed target genes were functionally enriched in the immune pathway. What's more, gene function analysis revealed the differentially expressed genes between the two CSCC subclasses to be enriched in several immune response pathways. High expression of 24 genes involved in these pathways exhibited a better overall survival and correlated with predicted neo-epitope burdens. Finally, when assessing the differentially expressed TFs and driver genes, we noticed that only two immune-related driver genes were differentially expressed between the two CSCC subclasses.

Analysis of DNA Methylation Identifies Two New Subclasses of CSCC
To identify subgroups of samples, we performed unsupervised hierarchical clustering based on DNA methylation data. We selected the top 30,000 most variable probes that showed the highest median absolute deviation (MAD) across beta values for clustering ( Figure S1a). Alternatively, the top 20,000 and 40,000 probes were also chosen for clustering, respectively. We observed the clustering based on the top 30,000 and 40,000 probes generated similar results, and the separation of patients was more distinct than the clustering based on the top 20,000 probes ( Figure S1b). In addition, we also chose the probes located in the promoter regions for clustering. As we observed the promoter probes showed relatively lower MAD ( Figure S1a), the top 10,000, 20,000, and 30,000 most variable promoter probes were used. Similarly, we noticed that the clustering results with the top 20,000 and 30,000 promoter probes performed better than the one with top 10,000 promoter probes, but achieved similar clustering structures ( Figure S1b). By comparing the clusters generated from all probes versus promoter probes, we found that both results achieved three major clusters consistently. However, we noticed that there existed three subgroups in Cluster 1 that failed to be divided by promoter-probe clustering, while they could be identified by all-probe clustering ( Figure S1b). Taken these results, we considered the clustering based on the top 30,000 probes, and used the three distinct clusters for subgroup separation. Using the clinical information, surprisingly, we found the derived clusters exhibited strong association with histological status (Figure 1). The patients assigned to Class 1 were mainly those with the CA subtype. Interestingly, Class 2 and Class 3 divided the CSCC subtype into two separate subclasses. The clusters derived from promoter probes also showed similar histological associations ( Figure S1b). We calculated the mean beta values for all the probes in each subclass and noticed Class 1 showed a higher methylation level (Student's t-test, p-value < 0.001; Figure S2). Although in the same CSCC subtype, Class 3 showed a lower global methylation level as compared with Class 2 (Student's t-test, p-value < 0.001; Figure S2). divided by promoter-probe clustering, while they could be identified by all-probe clustering ( Figure  S1b). Taken these results, we considered the clustering based on the top 30,000 probes, and used the three distinct clusters for subgroup separation. Using the clinical information, surprisingly, we found the derived clusters exhibited strong association with histological status (Figure 1). The patients assigned to Class 1 were mainly those with the CA subtype. Interestingly, Class 2 and Class 3 divided the CSCC subtype into two separate subclasses. The clusters derived from promoter probes also showed similar histological associations ( Figure S1b). We calculated the mean beta values for all the probes in each subclass and noticed Class 1 showed a higher methylation level (Student's t-test, p-value < 0.001; Figure S2). Although in the same CSCC subtype, Class 3 showed a lower global methylation level as compared with Class 2 (Student's t-test, p-value < 0.001; Figure S2). K is the number of clusters generated. The heat map is a visual representation of the consensus matrix, which is a matrix of sample pairs. Each matrix entry measures the proportion of times the pair's samples are clustered together across resampling iterations. In the heat map, values ranging from 0 (corresponding to two samples that are never clustered together) to 1 (corresponding to two samples that are always clustered together) are represented by white to dark blue, respectively. Samples in the matrix are ordered according to their cluster, resulting in a block-diagonal matrix. A dendrogram atop the heatmap is shown, which represents each cluster 1, 2, and 3, respectively; (b) bar plots show the number of patients in each subclass and their histological status.

Lowly Methylated Promoters in Class 3 Show Enrichment in the Binding of 5 TFs
Based on the methylome of each subclass, we then analyzed the differentially methylated probes between each of them ( Figure 2a). In contrast to Class 1, we observed there were more lowly methylated probes in Class 2 and Class 3 ( Figure 2b). This is consistent with the global methylation difference in which Class 1 showed a higher methylation level. However, there were more differentially highly methylated probes in Class 3 as compared with Class 2 (Figure 2b). Identification of subclasses and association with histological status. (a) Consensus clustering of DNA methylation reveals three distinct subclasses of cervical cancer. K is the number of clusters generated. The heat map is a visual representation of the consensus matrix, which is a matrix of sample pairs. Each matrix entry measures the proportion of times the pair's samples are clustered together across resampling iterations. In the heat map, values ranging from 0 (corresponding to two samples that are never clustered together) to 1 (corresponding to two samples that are always clustered together) are represented by white to dark blue, respectively. Samples in the matrix are ordered according to their cluster, resulting in a block-diagonal matrix. A dendrogram atop the heatmap is shown, which represents each cluster 1, 2, and 3, respectively; (b) bar plots show the number of patients in each subclass and their histological status.

Lowly Methylated Promoters in Class 3 Show Enrichment in the Binding of 5 TFs
Based on the methylome of each subclass, we then analyzed the differentially methylated probes between each of them ( Figure 2a). In contrast to Class 1, we observed there were more lowly methylated probes in Class 2 and Class 3 ( Figure 2b). This is consistent with the global methylation difference in which Class 1 showed a higher methylation level. However, there were more differentially highly methylated probes in Class 3 as compared with Class 2 (Figure 2b).  (c) The genomic region enrichment of differentially methylated probes between subclasses. We calculated the number of differentially methylated probes, and the number of all probes on the bead array located in each genomic region. Fisher's exact test was used to test the enrichment. Heat maps show the odds ratio and adjusted p-value for each category. "1_2_2LowerMethy" represents the lowly methylated probes in Class 2 in comparison with Class 1, "1_2_2HigherMethy" represents the highly methylated probes in Class 2 in comparison with Class 1, and so on.
By taking all the probes in the Infinium HumanMethylation 450 BeadChip array from TCGA methylation profiles, we performed genomic enrichment analysis for the differentially methylated probes (Figure 2c). Interestingly, the differentially methylated probes between CSCC and CA type mainly occurred in the intergenic and intron regions. Notably, even in the same CSCC type, we found the differentially methylated probes between Class 2 and Class 3 displayed opposite enrichment. The highly methylated probes in Class 3 mainly enriched in the intergenic and intron regions. However, the lowly methylated probes displayed strong enrichment in the promoter region (Odds ratio: 1.299963; q-value: 1.79 × 10 −124 ), followed by the intergenic and 5′ untranslated regions (UTR). Thus, different from other methylation changes that occurred in the intergenic and intron regions, the hypomethylation of the promoter and 5′ UTR regions may contribute to the separation of Class 3 from Class 2. In addition, these observations can also explain why we failed to identify some subgroups with promoter-probe clustering, since the methylation changes on the majority of probes mainly occurred in the intergenic and intron regions As the methylation of gene's promoter is usually linked to gene expression [21], we extracted probes located in the promoter region from all differentially methylated probes, and subsequently obtained their regulated target genes. On the other hand, the methylation level of promoters can affect TFs' binding, which could also regulate gene expression. Thus we then evaluated which TFs Probes with a q-value less than 0.01 are marked in red and the remaining in blue. Horizontal lines in red represent fold change = 1.05 and fold change = 0.95, respectively. Probes with a fold change > 1.05 and a q-value < 0.01 were selected as highly methylated, and a fold change < 0.95 and a q-value < 0.01 as lowly methylated. (b) Distribution of the number of highly and lowly methylated probes between each of the subclasses. (c) The genomic region enrichment of differentially methylated probes between subclasses. We calculated the number of differentially methylated probes, and the number of all probes on the bead array located in each genomic region. Fisher's exact test was used to test the enrichment. Heat maps show the odds ratio and adjusted p-value for each category. "1_2_2LowerMethy" represents the lowly methylated probes in Class 2 in comparison with Class 1, "1_2_2HigherMethy" represents the highly methylated probes in Class 2 in comparison with Class 1, and so on.
By taking all the probes in the Infinium HumanMethylation 450 BeadChip array from TCGA methylation profiles, we performed genomic enrichment analysis for the differentially methylated probes (Figure 2c). Interestingly, the differentially methylated probes between CSCC and CA type mainly occurred in the intergenic and intron regions. Notably, even in the same CSCC type, we found the differentially methylated probes between Class 2 and Class 3 displayed opposite enrichment. The highly methylated probes in Class 3 mainly enriched in the intergenic and intron regions. However, the lowly methylated probes displayed strong enrichment in the promoter region (Odds ratio: 1.299963; q-value: 1.79 × 10 −124 ), followed by the intergenic and 5 untranslated regions (UTR). Thus, different from other methylation changes that occurred in the intergenic and intron regions, the hypomethylation of the promoter and 5 UTR regions may contribute to the separation of Class 3 from Class 2. In addition, these observations can also explain why we failed to identify some subgroups with promoter-probe clustering, since the methylation changes on the majority of probes mainly occurred in the intergenic and intron regions As the methylation of gene's promoter is usually linked to gene expression [21], we extracted probes located in the promoter region from all differentially methylated probes, and subsequently obtained their regulated target genes. On the other hand, the methylation level of promoters can affect TFs' binding, which could also regulate gene expression. Thus we then evaluated which TFs could bind to the promoter region of these genes. Among the target gene sets of those differentially highly or lowly methylated promoters in each subclass, only the lowly methylated promoters in Class 3 showed significant enrichment in TF binding (Fisher's exact test, adjusted p-value < 0.05; Figure 3). This indicated that these TFs' binding promoter regions were lowly methylated in Class 3. In comparison with Class 1, there were only two TFs that were significantly enriched, NRSF (Neuron-Restrictive Silencer Factor) (Fisher's exact test, adjusted p-value = 2.9 × 10 −4 ) and one unknown TF (Fisher's exact test, adjusted p-value = 0.023). In contrast, five TFs showed enrichment in the lowly methylated promoter regions in Class 3 as compared with Class 2, especially NRSF (Fisher's exact test, adjusted p-value = 1.9 × 10 −9 ) and OCT1 (Fisher's exact test, adjusted p-value = 7.99 × 10 −4 ). Another unknown TF (Fisher's exact test, adjusted p-value = 0.0084) also significantly enriched. could bind to the promoter region of these genes. Among the target gene sets of those differentially highly or lowly methylated promoters in each subclass, only the lowly methylated promoters in Class 3 showed significant enrichment in TF binding (Fisher's exact test, adjusted p-value < 0.05; Figure 3). This indicated that these TFs' binding promoter regions were lowly methylated in Class 3.
In comparison with Class 1, there were only two TFs that were significantly enriched, NRSF (Neuron-Restrictive Silencer Factor) (Fisher's exact test, adjusted p-value = 2.9 × 10 −4 ) and one unknown TF (Fisher's exact test, adjusted p-value = 0.023). In contrast, five TFs showed enrichment in the lowly methylated promoter regions in Class 3 as compared with Class 2, especially NRSF (Fisher's exact test, adjusted p-value = 1.9 × 10 −9 ) and OCT1 (Fisher's exact test, adjusted p-value = 7.99 × 10 −4 ). Another unknown TF (Fisher's exact test, adjusted p-value = 0.0084) also significantly enriched. Figure 3. TFs binding enrichment around the lowly methylated promoters in Class 3. We found only these lowly methylated promoters in Class 3 showed significant TFs binding enrichment (Fisher's exact test, adjusted p-value < 0.05). Heat maps show the log2 fold enrichment and significance of each TF binding around these promoters. The high value of the log2 fold enrichment means the TF is highly enriched. The significance of enrichment is marked in red (q-value less than 0.05) in the right heat map. All TFs in the right side of the heat map were obtained from the Molecular Signatures Database (MSigDB). "1_3_3LowerMethy" represents the lowly methylated probes in Class 3 in comparison with Class 1; "2_3_3LowerMethy" represents the lowly methylated probes in Class 3 in comparison with Class 2. FC, fold change (see Methods).

Analysis of Differentially Expressed Genes between Subclasses and Their Correlation with Methylation
Using the RNA-seq data, we performed differential gene expression analysis between the subclasses (Figure 4a). In total, there were more lowly expressed genes in Class 2 as compared with Class 1, and more highly expressed genes in Class 3 as compared with Class 2 ( Figure S3a). Class 3 showed a comparable number of differentially expressed genes with Class 1. However, as we have shown above, Class 2 harbored more lowly methylated probes as compared with Class1, while Class 3 was more highly methylated than Class 2, which is inconsistent with the number of differentially expressed genes between them. By examining the overlap of genes between these differentially methylated genes and differentially expressed genes, we observed that the majority of differentially methylated genes did not lead to differential gene expression ( Figure S3b-d). However, the differentially expressed genes did show a correlation with the differentially methylated genes. As we observed, there were more lowly methylated genes with high expression, and more highly methylated genes with low expression.
For these differentially expressed genes, we calculated the mean beta values of their promoter probes and the results showed that the gene expression of these differentially expressed genes was negatively correlated with the methylation level of their promoters (Figure 4b). Thus, these results . TFs binding enrichment around the lowly methylated promoters in Class 3. We found only these lowly methylated promoters in Class 3 showed significant TFs binding enrichment (Fisher's exact test, adjusted p-value < 0.05). Heat maps show the log2 fold enrichment and significance of each TF binding around these promoters. The high value of the log2 fold enrichment means the TF is highly enriched. The significance of enrichment is marked in red (q-value less than 0.05) in the right heat map. All TFs in the right side of the heat map were obtained from the Molecular Signatures Database (MSigDB). "1_3_3LowerMethy" represents the lowly methylated probes in Class 3 in comparison with Class 1; "2_3_3LowerMethy" represents the lowly methylated probes in Class 3 in comparison with Class 2. FC, fold change (see Methods).

Analysis of Differentially Expressed Genes between Subclasses and Their Correlation with Methylation
Using the RNA-seq data, we performed differential gene expression analysis between the subclasses (Figure 4a). In total, there were more lowly expressed genes in Class 2 as compared with Class 1, and more highly expressed genes in Class 3 as compared with Class 2 ( Figure S3a). Class 3 showed a comparable number of differentially expressed genes with Class 1. However, as we have shown above, Class 2 harbored more lowly methylated probes as compared with Class1, while Class 3 was more highly methylated than Class 2, which is inconsistent with the number of differentially expressed genes between them. By examining the overlap of genes between these differentially methylated genes and differentially expressed genes, we observed that the majority of differentially methylated genes did not lead to differential gene expression ( Figure S3b-d). However, the differentially expressed genes did show a correlation with the differentially methylated genes. As we observed, there were more lowly methylated genes with high expression, and more highly methylated genes with low expression.
For these differentially expressed genes, we calculated the mean beta values of their promoter probes and the results showed that the gene expression of these differentially expressed genes was negatively correlated with the methylation level of their promoters (Figure 4b). Thus, these results indicated that the differential gene expression changes were regulated by the methylation levels of their promoters. indicated that the differential gene expression changes were regulated by the methylation levels of their promoters.

Immune-Related Genes are Highly Expressed in Class 3
For each set of differentially expressed genes between each subclass, gene function analysis using Database for Annotation, Visualization, and Integrated Discovery (DAVID) [22] showed that only those highly expressed genes in Class 3 were significantly enriched in the immune response system (q-value < 0.001; Figure 5a). Based on each class's methylation profiles, we then quantified (a) Significance of gene expression differences between each subclass. Each dot represents one gene. The x axis shows the gene expression difference by a log2 transformed fold change while the y axis shows significance by a −log 10 transformed p-value. Vertical lines in red represent log 2 FoldChange = −1 and log 2 FoldChange = 1, respectively. Horizontal line in red represents p-value = 0.001. We defined differentially expressed genes if their absolute values of log 2 FoldChange larger than 1 and p-value less than 0.001 (dot in red means high expression and blue means low expression). (b) Boxplot shows the distribution of the mean beta values of differentially expressed gene sets' promoters probes. The small black circle represents outlier. The black rod from top to bottom represents the maximum, upper quartile, median, lower quartile, and minimum value, respectively. The black dotted line represents the values between the intevals. p-value was calculated using a two-sided Student's t-test.

Immune-Related Genes are Highly Expressed in Class 3
For each set of differentially expressed genes between each subclass, gene function analysis using Database for Annotation, Visualization, and Integrated Discovery (DAVID) [22] showed that only those highly expressed genes in Class 3 were significantly enriched in the immune response system (q-value < 0.001; Figure 5a). Based on each class's methylation profiles, we then quantified the level of immune cells infiltration by calculating the immune cell fraction in each sample by EpiDISH (https://bioconductor.org/packages/release/bioc/html/EpiDISH.html) [23]. As shown in Figure 5b, the immune cell fraction in patients of Class 3 ranked the highest of all three groups, followed by that of Class 2 and Class 1. We also used the beta values of the top 30,000 most variable probes, and the differentially methylated probes between each subclass to calculate the immune cell fraction, and the results remained the same ( Figure S4). the level of immune cells infiltration by calculating the immune cell fraction in each sample by EpiDISH (https://bioconductor.org/packages/release/bioc/html/EpiDISH.html) [23]. As shown in Figure 5b, the immune cell fraction in patients of Class 3 ranked the highest of all three groups, followed by that of Class 2 and Class 1. We also used the beta values of the top 30,000 most variable probes, and the differentially methylated probes between each subclass to calculate the immune cell fraction, and the results remained the same ( Figure S4).
We above showed that five TFs were significantly enriched in the lowly methylated promoter regions in Class 3. Among those TFs' target genes, 28 were highly expressed in Class 3 ( Figure S5). In addition, we listed the expression of them in normal cervix tissue and cancer tissue (Table S1). We even observed their high expression in Class 3; some of them were also highly expressed in cervix tissue. Gene function analysis showed they were significantly enriched in the immune pathway (category: Kyoto Encyclopedia of Genes and Genomes (KEGG) PATHWAY, term: T cell receptor signaling pathway, q-value: 0.035; Fisher's exact test, p = 2.1 × 10 −5 ). This suggested that the hypomethylation of the promoters of these genes might recruit the five TFs' binding, leading to increased expression of the 28 target genes.  The difference in the immune cell fraction between different group was performed by two-sided Student's t-test. (c) Venn representation of overlaps among highly expressed genes in Class 3 as compared with Class 1 ("1_3_3HighExp"), highly expressed genes in Class 3 as compared with Class 2 ("2_3_3HighExp"), and all genes involved in immune pathways ("Immune_Genes") in (a).
We above showed that five TFs were significantly enriched in the lowly methylated promoter regions in Class 3. Among those TFs' target genes, 28 were highly expressed in Class 3 ( Figure S5). In addition, we listed the expression of them in normal cervix tissue and cancer tissue (Table S1). We even observed their high expression in Class 3; some of them were also highly expressed in cervix tissue. Gene function analysis showed they were significantly enriched in the immune pathway (category: Kyoto Encyclopedia of Genes and Genomes (KEGG) PATHWAY, term: T cell receptor signaling pathway, q-value: 0.035; Fisher's exact test, p = 2.1 × 10 −5 ). This suggested that the hypomethylation of the promoters of these genes might recruit the five TFs' binding, leading to increased expression of the 28 target genes.
Of those highly expressed genes in Class 3, a total of 117 genes were highly expressed as compared with both Class 1 and Class 2. Notably, among them, 57 genes were involved in the immune system (Figure 5c). This gene list represented one immune-related gene signature with high expression in Class 3. We next asked whether these genes had any prognostic relevance. Interestingly, of these 57 genes, survival analysis showed that patients with high expression of 24 genes exhibited a better overall survival ( Table 1). The proteins encoded by CD3E, CD247, and CD3D are components of the T cell receptor. Four chemokine receptors (CCR5, CXCR3, and CCR2), one interleukin-10 family (IL10RA), one interleukin-1 family (IL18RAP), and one chemokine (CXCL9) are involved in the cytokine-cytokine receptor interaction. In addition, we also checked the clinical relevance of the 24 genes from the Human Protein Atlas [24], and found high expression of them was also reported to exhibit better survival. Notably, four genes (CD3D, CD3E, CD7, and SELL) were reported to be associated with favorable prognostic value (Table 1). In addition, we observed that a total of 84 immune-related genes were highly expressed in Class 3 as compared with Class 2 (no immune genes were lowly expressed between these two classes) (Figure 5c). Thus we made a clinical comparison of the patients in CSCC based on the 84 genes' expression. Two groups were divided, and survival analysis showed the group with high expression of the 84 genes displayed better overall survival (logrank (Mantel-Cox) test, p = 0.01, Figure S6).

Correlation of 24 Immune-Related Genes' Expression with Predicted Neo-Epitope Burden
We above showed the 24 immune-related genes displayed potential prognostic characteristics since their high expression was associated with better survival. This indicates patients with a stronger immune system are more likely to live longer. On the other hand, the neo-epitopes, which could be derived from mutations in patients, if presented on major histocompatibility complex class I molecules (MHC-I), may render the tumor more susceptible to the immune system as they would be recognized as "nonself" neo-antigens. In our recent work, we demonstrated that a higher number of neo-epitopes in these same cervical cancers exhibited an association with better survival [25]. Using those predicted neo-epitopes from our previous work, we therefore investigated whether higher expression levels of the 24 genes were correlated with the number of neo-epitopes. Spearman correlation was calculated to assess the association of the 24 genes' expression with the number of neo-epitopes across patients. With the exception of ITK, PIK3CG, SELL, and BTK, we detected a significant positive correlation between the rest genes' expression and neo-epitope burden (Table 2). For comparison, we also randomly selected 24 genes across the genome to assess the association of their expression with the number of neo-epitopes. Furthermore, we repeated the random gene selection and the correlation significance calculation for 1000 times. Among each of the 1000 times' computing, we compared the difference in p-values between the 24 prognostic genes' and randomly selected genes', and calculated the false discovery rate. A false discovery rate of 0.005 indicated the significant correlation between high expression of those prognostic genes and the neo-epitope burden did not occur by chance. We also calculated the number of Spearman's rho among the 24 genes higher than the 24 randomly selected genes' in each sampling. It turned out that the majority of the 24 genes had a higher rho score against the random genes in the random sampling ( Figure S7).

Differentially Expressed TFs and Driver Genes between Each Subclass
In addition to the roles of TFs in the regulation of gene expression through their binding (or not) around promoter regions, the high or low expression of TFs can also influence gene expression. Thus we examined whether any TFs were differentially expressed between each subclass. Surprisingly, a total of 15 TFs were significantly differentially expressed in Class 2 and Class 3 as compared with Class 1 (Figure 6a). However, none of them were found to show differential expression between Class 2 and Class 3. Among them, the low expression of NFE2 and PITX2, and high expression of HLF on patients displayed better survival ( Figure S8).
Driver genes play important roles in cancer development. We next examined whether these three subclasses showed significant differential gene expression in driver genes. We obtained 138 driver genes from a previous report [26], and found a total of 15 driver genes were significantly differentially expressed between each subclass (Figure 6b). Among them, only two driver genes CARD11 and JAK3 showed significant differential expression between Class 2 and Class 3, both with high expression in Class 3. Interestingly, these two genes were also related to the immune response system. CARD11 is involved in both the T cell and B cell receptor signaling pathways. JAK3 is commonly expressed on T cells [27] and is also involved in the JAK (Janus kinases)/STAT (Signal Transducer and Activator of Transcription proteins) signaling pathway. Survival analysis showed that high expression of IKZF1, FGFR3, NFE2L2, and JAK3, while low expression of EGFR, exhibited better survival ( Figure S8).
In addition, we extracted the promoter probes for these differentially expressed TFs and driver genes and calculated the mean bata values in each subclass ( Figure S9). In general, the average methylation level in these genes' promoters was also negatively correlated with their gene expression.
around promoter regions, the high or low expression of TFs can also influence gene expression. Thus we examined whether any TFs were differentially expressed between each subclass. Surprisingly, a total of 15 TFs were significantly differentially expressed in Class 2 and Class 3 as compared with Class 1 (Figure 6a). However, none of them were found to show differential expression between Class 2 and Class 3. Among them, the low expression of NFE2 and PITX2, and high expression of HLF on patients displayed better survival ( Figure S8). Figure 6. Significantly expressed TFs and driver genes. Bar plots of log2 fold change in differentially expressed TFs (a) and driver genes (b) between each subclass. The bar in red means high expression and blue means low expression. Gene marked with an asterisk indicates that its high or low expression was associated with patient survival ( Figure S8). Figure 6. Significantly expressed TFs and driver genes. Bar plots of log 2 fold change in differentially expressed TFs (a) and driver genes (b) between each subclass. The bar in red means high expression and blue means low expression. Gene marked with an asterisk indicates that its high or low expression was associated with patient survival ( Figure S8).

Discussion
In this study, we applied unsupervised analysis to cervical cancer samples using methylation profiles to reveal new subclasses that correlated with histological status. Specifically, we revealed two subpopulations existing in the CSCC subtype. In addition to the difference in methylation level, these two subclasses also showed differences in TFs binding around the promoter regions and in gene expression. Gene function assessment revealed the two subclasses harbored major differences in the immune-related gene expression. The differences in the methylation level, together with the TFs binding around the promoters, might play roles in inducing and maintaining the different phenotypes. Our findings suggest high interpatient heterogeneity in cervical cancer, and are useful for cervical cancer classification and prediction of prognosis.
Additionally, integrative clustering based on multiple omics data could also be applied to decipher subpopulations among patients [28]. One recent study of 228 cervical cancers from TCGA integrated various data types, including copy number, DNA methylation, mRNA, and microRNA data, and also revealed the molecular heterogeneity of cervival carcinomas [29]. Interestingly, they also identified three clusters: two squamous clusters and an adenocarcinoma-rich cluster, which agreed with our findings. However, we obtained the histological associations directly from the clustering based on single methylation data. Unlike the immune gene expression differences between the two CSCC characterized in our study, they showed the two squamous clusters differed in the expression of keratin gene family members. This inconsistence might be due to the different sample size used. Moreover, we also observed the high immune gene expression CSCC subclass displayed high level of immune cell estimate, which further supported the existence of immune subtype in CSCC. They also performed unsupervised hierarchical clustering based on single DNA methylation data. Also, three clusters were identified: a small 'CpG island hypermethylated' (CpG island methylator phenotype (CIMP)-high) cluster, a CIMP-intermediate cluster and a CIMP-low cluster. By comparing with the integrative clusters, they found most of the patients in the adenocarcinoma cluster were CIMP-high, whereas the two squamous clusters contained a mixture of CIMP-intermediate and CIMP-low patients. In our study, we also showed the CA subclass Class 1 displayed a higher methylation level and harbored more differentially hypermethylated probes as compared with the two CSCC subclasses. Again, these results were also consistent. Furthermore, we revealed the hypomethylation of the promoter and 5 UTR regions may contribute to the separation of the two CSCC subclasses. Thus, even based on different sample size and clustering approaches, our study agreed with the main findings of the TCGA paper. However, our study provided more detailed molecular characterizations that have not been extensively explored in cervical cancer.
Some types of human papilloma virus (HPV) infection, especially HPV 16 and HPV 18, present the greatest risk factor for cervical cancer. HPV infection has been reported to be associated with the regulation of DNA methylation. For instance, the HPV 16 E7 oncoprotein is associated both in vitro and in vivo with the DNA methyltransferase DNMT1 and stimulates its enzymatic activity [30]. On the other hand, it has been suggested that HPV infection could alter the immune response in the pathogenesis of cervical cancer [31]. In the present study, we observed the methylation profiles divided the CSCC subtype into two separate subclasses that were different in immune-related gene expression. Among those immune response pathways, we observed the cytokine-cytokine receptor interaction pathway was highly expressed in Class 3. Interestingly, previous studies have shown that the HPV E6 and E7 proteins can directly interact with cytokines that are induced following infection [32][33][34]. Consequently, this results in the blockade of apoptosis and the continued acquirement of proliferation ability. In addition, we noticed STAT1 was highly expressed in Class 3. It is a key TF that regulates the interferon response which is also activated following viral infection. However, it has been reported that this activity could also be inhibited by HPV proteins [17]. It appears that HPV infection may induce the immune system response, but on the other hand, those HPV oncoproteins may act at several levels to interfere with this response. In this study, we only observed 15 patients had HPV infection data. Due to the lack of enough information in the TCGA regarding HPV infection, we were unable to examine the link between the subclasses and HPV infection status. Further detailed investigation into the molecular mechanisms is warranted.
Among those significantly expressed TFs and driver genes, we observed that the expression levels of eight genes were associated with patient survival. An early report suggested that NFE2 could play a role in megakaryocyte transformation [35], and that knockdown of NFE2-related factor 2 (NRF2) in cervical cancer could enhance the efficacy of anticancer drugs [36]. This suggests the ability of NFE2 to promote tumorigenesis. Here we showed that the low expression of NFE2 was associated with better survival, which revealed its similar role in cervical cancer. Furthermore, it was particularly significantly lowly expressed in Class 2 and highly expressed in Class 1. Increased expression of PITX2 has a critical function in ovarian cancer progression [37], while in our data, we observed it was highly expressed in Class 2 and lowly expressed in Class 1. Previous study demonstrated PITX2 serves as one promising predictive biomarker in esophageal squamous cell carcinoma prognosis [38]. We observed that low expression of PITX2 was associated with better survival, which also displayed its prognostic characteristic in cervical cancer. HLF is one hypoxia response regulator, and its transcriptional role varies among tumor types [39]. In our data, it was highly expressed in Class 2 and lowly expressed in Class 1, and its high expression displayed better survival. IKZF1, one critical regulator of lymphocyte development [40], was highly expressed in Class 3 while lowly expressed in Class 1. This is also one immune related gene and its high expression was also associated with better survival. Our observation that the high expression of EGFR predicted poor survival in cervical cancer has been confirmed in a previous report [41]. More precisely, we reported EGFR was extremely highly expressed in the CSCC subclass Class 3 and lowly expressed in the CA subclass Class 1. The association of the high expression of FGFR3 with better survival in cervical cancer was also confirmed by a recent study [42]. Here, we observed its high expression in the CSCC subtype (both Class 2 and Class 3) and also its association with better survival. NFE2L2, previously identified as a recurrently mutated gene in cervical cancer [31], was highly expressed in Class 2 and lowly expressed in Class 1. Also, its high expression was associated with better survival in our data. JAK3, one of two significantly highly expressed driver genes between the two CSCC subclasses, displayed better survival.
In addition to those eight genes displaying prognostic characteristics, we also identified 24 immune-related genes in Class 3 and their high expression was associated with better survival. Consistently, among these, a previous study using squamous cell cervical cancer samples demonstrated that the high expression of the T cell receptor component, CD3E, is correlated with improved patient survival [43]. From the Human Protein Atlas, it was shown that the high expression of CD3E was also associated with patients' long-term survival in other types of cancer, such as endometrial cancer, melanoma, head and neck cancer, and breast cancer. For other immune-related genes, it will be interesting to investigate their involvement in other cancer types. It should be noted that we divided patients into high/low expression group based on the fourth/second quantile value. Thus other kind of group division method should give different clinical significance. In this study, we also checked these genes' clinical relevance from the Human Protein Atlas where gene expression values from the 20th to 80th percentiles were used to group the patients. Consistently, the prognostic value of these genes remained the same as our examination. Specially, four genes (CD3D, CD3E, CD7, and SELL) were reported to be associated with favorable prognostic value in the Human Protein Atlas. It was noteworthy that a larger patient sample size should make the prognostic gene list more stable. For the immune-related genes, when the sample size is increased, there may be more genes included in the list, since other immune pathways would be identified as significantly enriched. More cervical cancer genomic data are expected in the future; thus we can stabilize the gene list, and a follow-up study for validation will also be feasible. Confirmation of those prognostic genes could represent biomarker signatures for each subclass, which will be helpful for large-scale classification and improvements in prognosis prediction. Nevertheless, in this study, we were still able to achieve reliable gene lists, paving the way for future exploration.

Summary of Samples
We downloaded methylation data for a total of 307 cervical cancer samples (Illumina Infinium Human DNA Methylation 450 platform, beta values) and clinical information for all patients from TCGA under Genomic Data Commons (GDC) (Bethesda, MD, USA). In total, 485,577 methylation probes were used to explore the DNA methylation profile on the genome scale. Beta values that ranging between 0 and 1 were used to represent the relative methylation level, which were measured as the ratio of the methylated probe intensity over all methylation probe intensities.

Consensus Clustering
We performed consensus clustering using the ConsensusClusterPlus [44] R package (R Core Team, Vienna, Austria). The top 30,000 most variable probes that showed the highest MAD across the beta values were selected for clustering. Alternatively, the top 20,000 and 40,000 probes were also chosen for clustering, respectively. In addition, we defined the promoter probe if the probe located in the region of 3000 bp around the transcription start site with 1500 bp upstream and 1500 bp downstream. The top 10,000, 20,000, and 30,000 most variable promoter probes were also used to perform hierarchical clusterings. The following settings were used in the consensus clustering: Number of resamplings: 1000; pItem = 0.9 (resampling frequency samples); pFeature = 0.9 (resampling frequency); Pearson distance metric; Ward linkage clustering method. We analyzed consensus matrices for the number of clusters k from 2 to 6 and found the most robust result with a 3-cluster solution.
Based on the methylome of each subclass, we calculated the mean beta values for all probes in each subclass. The two-sided Student's t-test was applied to compare the global methylation difference between each subclass. A p-value less than 0.001 was considered to indicate significance.

Differential Methylation Analysis
We performed differentially methylation analysis for each probe based on the beta value [45] using the Samr [46] package in the R software (R Core Team, Vienna, Austria). The significance of differentially methylated probes between each subclass was performed. Probes with a fold change > 1.05 and a q-value < 0.01 were selected as highly methylated ones, and those with a fold change < 0.95 and a q-value < 0.01 as lowly methylated.

Genomic Region Enrichment
The genomic region, including intergenetic regions, the 5 UTR, whole exon regions, whole intron regions, and 3 UTR were obtained from the University of California, Santa Cruz (UCSC) genome browser [47]. The promoter region was defined as 3000 bp around the transcription start site with 1500 bp upstream and 1500 bp downstream. The probe was taken as being located in each genomic region when its location overlapped with the corresponding region. We calculated the number of differentially methylated probes located in each genomic region. All the probes in the Infinium HumanMethylation 450 BeadChip array from TCGA methylation profiles were extracted, and the number of these probes located in each genomic region was also calculated. Fisher's exact test was used to test the enrichment. Odds ratio and p-value were obtained, and the p-value was adjusted using the Benjamini and Hochberg method.

TFs Binding Enrichment and Target Gene Function
We extracted the differentially methylated promoter probes by checking whether those differentially methylated probes were located in the promoter regions. We then defined the highly or lowly methylated genes by way of evaluating whether their promoter contained differentially highly or lowly methylated probes (as described above). We excluded genes when their promoter harbored both highly and lowly methylated probes.
Thus, for comparison of each pair of subclasses, there were two sets of differentially methylated genes: highly methylated and lowly methylated genes. Next, we examined whether these differentially methylated genes' promoters enriched in any TFs binding. We downloaded a total of 283 TFs' target genes from MSigDB. For each set of differentially methylated genes c, we defined N as the total number of nonredundant target genes in MsigDB, n c as the number of differentially methylated genes, K s as the number of target genes for each TF s, and k sc as the number of differentially methylated genes that was found in each TF's target genes. Fisher's exact test was then performed to test whether each TF's target genes were significantly enriched in those differentially methylated genes. The fold enrichment was defined as log 2 ([(k sc + 1)/(K s + 1)]/[(n c + 1)/(N + 1)]), similar to one previous approach [48]. The adjusted p-value for each TF was calculated using the Benjamini and Hochberg method. The significantly enriched TFs binding was considered if the adjusted p-value less than 0.05.
For those significantly enriched TFs' target genes with high expression in Class 3, we used DAVID [22] to perform gene function analysis. A q-value less than 0.05 was considered to indicate statistical significance.

Analysis of Differentially Expressed Genes and Gene Function
RNA sequencing raw reads count data were downloaded from IlluminaHiSeq_RNASeqV2 (Level 3) in TCGA under GDC. The DESeq [49] package in the R software (R Core Team, Vienna, Austria). was applied to the identification of differentially expressed genes between each subclass. The p-value was adjusted using the Benjamini and Hochberg method. We defined genes as differentially expressed when their absolute log 2 FoldChange was larger than 1 and the adjusted p-value was less than 0.001.
For each differentially expressed gene set in each subclass pair comparison, we extracted the probes of the promoter, and calculated the mean beta values of the probes in each subclass. A two-sided Student's t-test was used to compare the differences in the methylation levels of these promoters between each pair of two subclasses. A p-value less than 0.001 was considered to indicate significance.
Gene function analysis of these differentially expressed genes was performed using DAVID. The significantly enriched pathway was considered if the q-value less than 0.001.

Immune Cell Fraction Calculation
Based on DNA methylation data, immune cell fraction was predicted by EpiDISH algorithm [23]. In general, the value represented the level of immune cells infiltration in tumor. We also performed immune cell fraction calculation based on the beta values of the top 30,000 most variable probes, and the differentially methylated probes between each subclass.

Correlation Analysis of Gene Expression with Predicted Neo-Epitope Burden
Using the exome-sequencing data of the same cervical cancer patients from TCGA, we have previously predicted the neo-epitopes based on somatic mutations [25]. In general, the neo-epitopes were obtained if the mutant peptides showed strong binding affinity with MHC-I. Based on these results, we calculated the number of neo-epitopes in each patient. The normalized read counts obtained from IlluminaHiSeq_RNASeqV2 (Level 3) in TCGA were taken as the gene expression values. For each of the 24 prognostic genes in our study, Spearman correlation was calculated to assess the association of gene expression with the number of neo-epitopes across patients. A two-tailed p-value of less than 0.05 was considered to indicate statistical significance. We randomly sampled the same number of genes from all the genes in the human genome without replacement. For each of those randomly selected genes, we also computed the association of gene expression with number of neo-epitopes. We performed 1000 such random samplings and calculations. In each sampling, we compared the p-values of those 24 prognostic genes and the randomly sampled genes using a two-sided Student's t-test. If the mean of those 24 prognostic genes' p-values was less than the mean of the random genes', and the significance of difference satisfied a p-value less than 0.001, we considered the correlation of those 24 prognostic genes' expression with the neo-epitope burdens was significant in that sampling. A false discovery rate was then calculated based on those 1000 times' comparison.

Survival Analysis
Survival curves were generated using the Kaplan-Meier method. For each gene, a patient was classified as high expression when the expression value was above the fourth quantile, and low expression when below the second quantile. Differences were evaluated using the logrank (Mantel-Cox) test. Overall survival was calculated from the time of initial diagnosis to death or censored to the time at which the patient was last known to be alive. A p-value less than 0.1 was considered statistically significant. Hazard ratios and associated 95% confidence intervals were calculated with the use of the Cox proportional-hazards model. All tests were two-sided and all calculations were performed with the R Version 3.3.1 statistical software (R Core Team, Vienna, Austria).

Conclusions
In conclusion, the present study investigated the methylation data obtained from TCGA to revisit the classification of cervical cancer subtypes, and identified two new subclasses in the CSCC subtype. By an integrative analysis of gene expression data, our results revealed major differences in immune-related gene expression among these two subclasses. Our results provide important insight into interpatient heterogeneity among cervical cancer, which improves our ability to classify these tumors and contributes to prognostic and diagnostic use in clinics.