Integrative analysis of DNA methylation and gene expression through machine learning identifies stomach cancer diagnostic and prognostic biomarkers

Abstract DNA methylation is an early event in tumorigenesis. Here, by integrative analysis of DNA methylation and gene expression and utilizing machine learning approaches, we introduced potential diagnostic and prognostic methylation signatures for stomach cancer. Differentially‐methylated positions (DMPs) and differentially‐expressed genes (DEGs) were identified using The Cancer Genome Atlas (TCGA) stomach adenocarcinoma (STAD) data. A total of 256 DMPs consisting of 140 and 116 hyper‐ and hypomethylated positions were identified between 443 tumour and 27 nontumour STAD samples. Gene expression analysis revealed a total of 2821 DEGs with 1247 upregulated and 1574 downregulated genes. By analysing the impact of cis and trans regulation of methylation on gene expression, a dominant negative correlation between methylation and expression was observed, while for trans regulation, in hypermethylated and hypomethylated genes, there was mainly a negative and positive correlation with gene expression, respectively. To find diagnostic biomarkers, we used 28 hypermethylated probes locating in the promoter of 27 downregulated genes. By implementing a feature selection approach, eight probes were selected and then used to build a support vector machine diagnostic model, which had an area under the curve of 0.99 and 0.97 in the training and validation (GSE30601 with 203 tumour and 94 nontumour samples) cohorts, respectively. Using 412 TCGA‐STAD samples with both methylation and clinical data, we also identified four prognostic probes by implementing univariate and multivariate Cox regression analysis. In summary, our study introduced potential diagnostic and prognostic biomarkers for STAD, which demands further validation.


| INTRODUC TI ON
Stomach cancer is the 5th most frequently diagnosed, the 7th most prevalent cancer and the third cause of cancer-related deaths, worldwide. 1,2 The majority of stomach cancer patients are diagnosed at an advanced stage, resulting in a poor prognosis. 3 Early diagnosis of this cancer is very challenging, because of its rare initial symptoms. 4 Although some biomarkers have been proposed for detecting the early stage of stomach cancer or its progression, there is still a significant gap in identifying biomarkers with high sensitivity and specificity for effective diagnosis and prognosis of this cancer. 3 DNA methylation is an epigenetic event that involves adding a methyl group to 5th carbon of cytosine to form 5-methylcytosine. 5 DNA methylation is involved in a variety of biological processes, including gene expression regulation, 6 alternative splicing, 7 genomic imprinting, 8 cell differentiation, 9 development 10 and inflammation. 11 Methylation abnormalities have been linked to a variety of diseases, including cancer. 12  We furthermore introduced four methylation probes as prognostic biomarkers for STAD and checked their performance for predicting high-and low-risk patients independently from clinicopathological features.

| Data sources and preparation
DNA methylation and gene expression profiles for STAD were downloaded from TCGA database using TCGAbiolinks R package. 15,16 The training set was the methylation data that consisted of a dataset including 48 STAD tumour and 25 nontumour samples assayed using Illumina HumanMethylation27 BeadChip platform (27 K array) and a dataset of 395 tumour and 2 nontumour STAD samples assayed using Illumina HumanMethylation450 BeadChip platform (450 K array). Although using 450 K arrays would increase the number of methylation probes, TCGA-STAD has only 2 nontumour samples profiled with 450 K array. By combining two different array platforms (27 K and 450 K arrays), a number of nontumour samples will be increased to 27, which is more rational in statistical analyses. So, we analysed the 25,978 CpG sites, which were common between 450 K and 27 K platforms. The methylation level of CpGs was described as beta values (β), which are the ratio of the intensities between methylated and total signals, ranging from 0 to 1. Annotations for CpG probes were retrieved using a R package IlluminaHumanMethylation450kanno.ilmn12.hg19 according to human reference genome assembly GRCh37 (hg19). Level 3 Illumina HiSeq_RNASeqV2 data of 450 samples (415 STAD and 35 nontumour samples) were downloaded as a source of transcriptome data.
Moreover, an independent DNA methylation dataset, i.e., GSE30601 was collected from the GEO as a validation set. This dataset consisted of 203 gastric tumours and 94 matched gastric nonmalignant samples, which was profiled using an Illumina HumanMethylation27 BeadChip platform (GPL8490).

| Differential analysis of gene expression
We used TCGA-STAD transcriptome data for the identification of differentially-expressed genes (DEGs) utilizing DESeq2 R package. 19 After normalizing the data, those genes satisfied the threshold of |log 2 fold change (FC)| > 1.5 and adjusted p-values < 0.05 were considered to be statistically significant differentially-expressed.

| Correlation analysis between DNA methylation and gene expression
In this study, we examined the impact of DNA methylation on both local (cis) and distant (trans) regulation of gene expression. For cisregulation, the Pearson correlation was calculated between β values of those CpGs, which resided in the promoter regions and normalized expression values of their corresponding genes. The criteria to screen out the significant correlations were |correlation coeffi-cient| > 0.3 and adjusted p-values < 0.05. Since there can be multiple CpGs in one promoter, the relationship between each CpG-gene pair was independently evaluated in Pearson correlation calculations.
In distant regulation, the Pearson correlation between β values of differentially-expressed and -methylated genes and normalized expression of differentially-expressed genes were explored.

| Prognostic biomarker selection
At first, those methylation probes with negative correlation with normalized expression values of genes were selected for subsequent analysis (correlation coefficient < −0.3 and adjusted p-values < 0.05).
As the negative correlation between methylation and gene expression is much well-understood, 24 so just those methylation probes with negative correlation with normalized expression values of genes were used in the analyses. 412 TCGA-STAD samples with both methylation and clinical data were used for prognosis analysis.
They were randomly divided into 60% training and 40% validation sets. Univariate Cox proportional hazard regression analysis was performed to select those methylation probes, which were significantly associated with the overall survival (OS) of patients with p-values < 0.01. Selected CpG probes were then subjected to multivariate Cox stepwise regression analysis using a MASS R package.
In the multivariate analysis, both forward and backward directions were used and the selection was based on the Akaike information criterion (AIC). 25 A risk score formula was then constructed based on a linear combination of methylation levels of probes multiplied by their coefficients in multivariate Cox regression analysis, as follows 26,27 : Where: • k is the number of candidate probes • i is the coefficient index of candidate probe • S i is the methylation level of candidate probe Patients were then divided into high-and low-risk groups based on the median of calculated risk score and their OS was compared by applying Kaplan-Meier analysis in an R-based survminer package.

| Gene set enrichment analysis
Gene set enrichment analysis (GSEA) 28 was performed for the highrisk and low-risk groups in order to explore the biological pathways of CpG biomarkers. The annotated c2.cp.kegg.v7.5.1.symbols.
gmt gene set was regarded as the gene set database. The criteria to screen out significant GSEA results were false discovery rate (FDR) q-value < 0.05, p-value < 0.01, normalized enrichment score (NES) > 1 and enrichment score (ES) > 0.6.

| Statistical analysis
The calculations were performed in R software version 4.1.2.
Machine learning sections were implemented in Python version 3.9.7 on anaconda 4.10.3. All p-values were adjusted with Benjamini/ Hochberg method.

| Identification of STAD DMPs and DEGs
The study procedures for the identification of STAD diagnostic biomarkers are presented in Figure 1. By performing preprocessing steps, from 48,5577 probes in the primary data, 25,978 of them were identified to be common between 27 K and 450 K platforms and re- and gene body. Hypermethylated probes were mainly observed in TSS1500 while hypomethylated sites were predominantly located in the 5′UTR and first exon ( Figure 2C). In order to find differentiallyexpressed genes, we next analysed gene expression data of 415 tumour and 35 nontumour TCGA-STAD samples. Analysis of these data revealed a total of 2821 DEGs (|log 2 FC| > 1.5 and adjusted pvalues < 0.05), which were divided into 1247 upregulated and 1574 downregulated genes ( Figure 2D).

| Identification of relevant DNA methylation changes associated with gene expression
We next investigated the intersection between differentiallymethylated genes (DMGs) and DEGs. Genes were considered to be DMGs, if there was at least one differentially-methylated probe in their promoter region, including TSS1500, TSS200, 5′UTR and first exon. The common genes were classified into four groups named: hypermethylated-upregulated (hyper-up), hypermethylateddownregulated (hyper-down), hypomethylated-upregulated (hypoup) and hypomethylated-downregulated (hypo-down) ( Figure 3A and Table S1). Among 142 hypermethylated genes, 27 and 7 of them showed a downregulation and upregulation pattern, respectively, and from 96 hypomethylated genes, 10 and 8 of them were down-and upregulated, respectively ( Figure 3B). Because aberrant hypermethylation of promoter is a critical epigenetic phenomenon in cancer and it impacts important genes and mechanisms related to cancer pathogenesis, like inactivation of tumour suppressor genes, 29,30 in this study we just included those genes with simultaneous hypermethylated and downregulated pattern for subsequent analysis and biomarker identification.
Correlation analysis is a common approach to study the relationship between DNA methylation and gene expression. 31,32 Here, we investigated the impact of DNA methylation on both local (cis) and distant (trans) regulation of gene expression. In cisregulation, the main goal was to explore the impact of promoter methylation of one gene on its own expression. In total, 18,459 F I G U R E 1 The workflow for identification of STAD diagnostic biomarkers. Hypermethylated probes locating in the promoter region of downregulated genes were used for biomarker selection. After removing correlated features, eight final methylation probe candidates were identified using RFECV. SMOTETomek method was applied to address the imbalance issue of training set, which increased the number of nontumour samples to 443. Different algorithms were compared based on their AUCs and SVM was selected as the best final model. SVM's hyperparameters were tuned and its performance was evaluated in an external validation set. AUC, area under the curve; DEGs, differentiallyexpressed genes; DMGs, differentiallymethylated genes; DMPs, differentiallymethylated positions (probes); RFECV, recursive feature elimination with cross-validation; SMOTETomek, synthetic minority oversampling with Tomek link; STAD, stomach adenocarcinoma; SVM, support vector machine.
CpG-gene pairs were analysed in which 207 among them showed a significant positive correlation, 1413 showed a significant negative correlation, and as multiple CpGs can be found in the promoter of one gene, 13 of genes showed both significant positive and negative correlations. A dominant negative correlation was also observed in correlation analysis between DEGs, DMGs and genes with simultaneous differential expression and methylation and CpGs ( Figure 3C).
In order to assess the association between the expression of a gene and the promoter methylation status of other genes, a Pearson correlation analysis was conducted between 51 CpG probes associated with genes showing simultaneous differential expression and methylation and 2821 DEGs. Results revealed that hypermethylated genes had predominantly a negative correlation with gene expression while hypomethylated genes were more likely to be positively correlated with the expression of DEGs ( Figure 3D).

| Identification of diagnostic biomarkers for STAD
In order to find diagnostic biomarkers for STAD, we started with 28 hypermethylated probes locating in the promoter region of 27 downregulated genes. Hierarchical clustering of these 28 probes from both TCGA and GSE30601 methylation data, showed that tumour and nontumour samples could be well-separated from each other based on these probes ( Figure 4A). Of these 28 candidate probes, we selected those which had the Pearson correlation coefficient of less than 0.7. This correlation-based filtration resulted in removing 11 probes out of those first 28 candidate probes. The recursive feature elimination with the 10-fold cross-validation (RFECV) method was then used to further filter out and evaluate the discriminative power of those 17 remaining candidate probes. RFECV selected 8 probes as final candidate probes ( Figure 4B). Table 1 shows the features of these 8 final probes.

F I G U R E 2
DMPs and DEGs analysis. Volcano plot for DMPs. DMPs were found with the criteria of |∆β| > 0.25 and adjusted pvalue < 0.05. Red and blue dots are hypermethyalted and hypomethylated DMPs, respectively, and the gray ones were not significant according to the above-mentioned criteria (A). Distribution of DMPs in different genomic regions defined by the distance from CGI. Red and dark blue parts stand for hypermethylated and hypomethylated DMPs, respectively (B). Distribution of DMPs in different genomic regions defined by the distance from TSS. Red and dark blue parts represent for hypermethylated and hypomethylated DMPs, respectively (C). Volcano plot for DEGs. DEGs were found with the criteria of |log 2 FC| > 1.5 and adjusted p-value < 0.05. Red and blue dots represent upregulated and downregulated DEGs, respectively, and the grey ones were not significant according to the defined criteria (D). 1stExon, the first exon; 5′UTR, 5′ untranslated region; adj-p-value, Benjamini/Hochberg adjusted p-value; CGI, CpG island; DEGs, differentially-expressed genes; DMPs, differentially-methylated positions (probes); Island, CpG island; N_Shelf, North Shelf (2-4 kb upstream of CGI); N_Shore, North Shore (0-2 kb upstream of CGI); Open Sea, further than 4 kb from CGI; S_Shelf, South Shelf (2-4 kb downstream of CGI); S_Shore, South Shore (0-2 kb downstream of CGI); TSS, transcription start site; TSS1500, 200-1500 nucleotides upstream of TSS); TSS200, 0-200 nucleotides upstream of TSS. F I G U R E 3 Integrative analysis of gene expression and DNA methylation. Scatterplot of mean methylation differences versus log 2 expression fold change. Each point belongs to a specific CpG-gene pair. Based on the mean methylation differences and log 2 FCs, genes were classified into four groups including: hypomethylated-downregulated (yellow), hypomethylated-upregulated (blue), not significant (gray), hypermethylated-downregulated (red) and hypermethylated-upregulated (green) (A). Venn diagram showing the intersection between hypermethylated genes and DEGs (top) and hypomethylated genes and DEGs (bottom) (B). Local regulation of a gene's expression by its own promoter methylation. Pearson correlation was investigated for all genes, DEGs, DMGs and differentially-expressed and -methylated genes. Those relationships with |correlation coefficient| > 0.3 and adjusted p-value < 0.05 were considered as significant. In all cases a dominant negative correlation was observed (C) Trans regulation of gene expression by promoter methylation. Pearson correlation was calculated between 51 differentially-expressed and -methylated genes and 2821 DEGs. Hypermethylated and hypomethylated genes showed a predominantly negative and positive correlation with gene expression of DEGs (D). DE DM genes, differentially-expressed and -methylated genes; DE genes, differentially-expressed genes; DEGs, differentially-expressed genes; DM genes, differentially-methylated genes; DMGs, differentially-methylated genes.

| Model construction and evaluation
were the selected SVM hyperparameters for the final model, based on RandomizedSearchCV method in python. 33 Model performance was evaluated on the training set with 10-fold cross-validation. In 10-fold cross-validation, data is randomly divided into 10 partitions. Each time, nine partitions will be used for training the model and one for checking its performance. Based on this approach, the AUC of the model was 0.99, which shows its great ability to distinguish STAD tumour and nontumour samples ( Figure 5B). To investigate the reproducibility of the results, we furthermore checked the model's performance on an external validation set (GSE30601) with 10-fold cross-validation ( Figure 5C). Results of this analysis were consistent with the training set and the AUC of the model on validation set was 0.97 showing that these 8 final candidate probes and the constructed SVM model have the ability to split STAD samples into tumour and nontumour ones. Besides AUC, other performance metrics including F1-score, precision, specificity, sensitivity (recall) and accuracy are also provided in Table 2. To make sure that these potential biomarkers are suitable for the early detection of STAD, we compared their methylation levels between different tumour stages and nontumour samples. All of them were significantly hypermethylated in stage I tumours compared with nontumour samples ( Figure 5D). Figure 6A shows the workflow of the current study to find STAD prognostic biomarkers. A total of 1605 methylation probes were screened out for the next steps, based on the criteria previously mentioned. TCGA-STAD methylation data were partitioned into training and validation cohorts at a ratio of 0.6:0.4 (248 training and 164 validation samples). Univariate Cox hazard regression analysis was implemented on training data, which showed 40 probes were significantly (p-value < 0.01) associated with OS of TCGA-STAD patients. These 40 probes were used as inputs for multivariate Cox stepwise regression, and finally, four probes were selected to be in the final model ( Table 3). The risk scoring formula for the prognostic model is presented as Equation 1:

| Identification of prognostic biomarkers in STAD
According to the risk score formula, hypermethylation of all these probes except cg25932713 was associated with longer OS in TCGA-STAD patients. At the same time, we divided the training cohort

| Assessing the prognostic potential of four candidate probes in the training and validation sets
To assess the discriminative power and accuracy of these four prognostic biomarkers, first Kaplan-Meier plots were generated for training and validation sets and patients were classified as low-and high-risk groups according to the median of the calculated risk score ( Figure 7A).
In both training and validation cohorts, low-risk patients had longer OS than the high-risk group and this difference was statistically significant (p-values < 0.01). We furthermore evaluated the performance of the combined model via the generation of time-dependent ROC curves ( Figure 7B). The AUC values at different times (1, 3 and 5 years follow-ups), for training and validation sets were greater than 0.67 and 0.57, respectively.

| Independence of the STAD prognostic model in OS prediction
In order to evaluate the independence of four introduced prognostic biomarkers from clinicopathological features like age, gender and American Joint Committee on Cancer (AJCC) stage (according to versions 6 and 7, provided in TCGA clinical data), we stratified patients based on these characteristics. Moreover, for assessing the independence from the stage, patients were grouped as an early stage if they had stage I or II and late stage if they had stage III or IV. For all these clinical features, Kaplan-Meier analysis showed a significant extension in overall survival between high-and low-risk groups, and AUC values of time-dependent ROC curve analysis also showed a distinction between these two groups as well ( Figure S1).

| GSEA analysis
In order to evaluate the mechanisms by which these four introduced prognostic biomarkers working throw, we conducted GESA analysis on the samples, which were divided into low-and high-risk groups based on the median of the risk score. We found that pathways like the extracellular matrix (ECM) receptor interaction, glycosaminoglycan biosynthesis chondroitin sulfate and glycosphingolipid biosynthesis ganglio-series were significantly correlated with gene expression data of the high-risk group ( Figure S2).

| DISCUSS ION
In the present study, we systematically analysed DNA methylation and gene expression data of STAD from TCGA cohort. We first identified differentially-methylated and -expressed genes and then examined the correlation between DNA methylation and gene expression. We observed that there was a general negative correlation between methylation and gene expression in the context of cis-regulation. Abnormal increased DNA methylation in normally unmethylated gene promoter CpG islands and associated gene silencing are the most well-characterized epigenetic alterations in cancer. 29,34 Our findings were consistent with these general characteristics of tumour DNA methylation pattern. and hepatocellular carcinoma 32 in the distant (trans) regulation of gene expression. In agreement with these studies, 31,32 we also observed that in the trans regulation, hypermethylated genes mainly had a negative correlation with gene expression, but on the contrary, hypomethylated genes were more likely to be positively correlated with gene expression. The exact mechanism behind the distant impact of promoter methylation on gene expression is not yet fully elucidated. 31 To find diagnostic biomarkers for STAD, we selected hypermethylated probes locating in the promoter region of downregulated genes for further analysis. We then removed highly correlated features, as having correlated features can make it difficult for an algorithm to generalize and this can result in biomarker redundancy. 35 RFECV method resulted in the selection of eight probes with potential diagnostic abilities. For model construction, TCGA-STAD methylation data (including 443 tumour and 27 nontumour samples) was used as the training set. Imbalanced data sets cause several challenges in the process of machine learning including making the model to be biased towards the majority class or even completely ignoring the minor class. 36   Hypermethylation of BMP3 promoter has been reported in GC. 43 Analysis of the RNA sequencing data of the TCGA-STAD cohort showed that the expression of BMP3 in gastric cancer tissues is significantly decreased compared with normal tissue and its higher expression in the primary GC tumours is associated with poorer overall survival. BMP3 is negatively correlated with cell proliferation and cell cycle-promoting factors suggesting its inhibitory role in the proliferation of GC cells. 44 For the three remaining probes (genes), there are not much evidence about their role in GC. However, their involvement in cancer pathogenesis has been reported in other neoplasms. [45][46][47] Four probes (cg25932713, cg25239996, cg26556719 and cg02968557) were introduced as potential STAD prognostic biomarkers by implementing univariate and multivariate Cox regression analysis. cg25932713 is mapped to two DNA repair-associated genes named PARP9 and DTX3L, which were among the 10 most significantly downregulated genes in CDH1 mutated diffuse GC. 48 Our Glycosphingolipids are a subtype of glycolipids, which are the major components of surface membranes in all animal cells. They are classified into three groups named ganglio-series, lacto-series and globo-series based on the type of their oligosaccharide chains.
Their abnormal expression has been observed in different cancer types including gastric cancer. 53 It should be noted that the current study has a number of limitations. First, the majority of the population in the TCGA database is Latino and Caucasian, hence evidence supporting the generalization of the present study's findings to other ethnic groups is required.
Second, the prognostic analyses of this study were based on a retrospective cohort, so a prospective multicenter validation is needed before using the introduced prognostic probes as biomarkers. Third, additional in vivo and in vitro experiments are required to clarify the molecular mechanisms behind the aetiology of STAD with respect to those eight diagnostic and four prognostic methylation probes, introduced in the current study.
In summary, DNA methylation plays crucial roles in cancer development. In the current study, utilizing machine learning approaches on TCGA-STAD data, we identified a set of eight methylation probes, being differentially-methylated in tumour compared with nontumour tissues, which can be used in future for the early detection of STAD. We furthermore found four prognostic probes including writing -review and editing (lead).

CO N FLI C T O F I NTER E S T S TATEM ENT
The authors confirm that there are no conflicts of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
The codes used to perform the analysis are available in the Mendeley