Data-Driven-Based Approach to Identifying Differentially Methylated Regions Using Modified 1D Ising Model

Background DNA methylation is essential for regulating gene expression, and the changes of DNA methylation status are commonly discovered in disease. Therefore, identification of differentially methylation patterns, especially differentially methylated regions (DMRs), in two different groups is important for understanding the mechanism of complex diseases. Few tools exist for DMR identification through considering features of methylation data, but there is no comprehensive integration of the characteristics of DNA methylation data in current methods. Results Accounting for the characteristics of methylation data, such as the correlation characteristics of neighboring CpG sites and the high heterogeneity of DNA methylation data, we propose a data-driven approach for DMR identification through evaluating the energy of single site using modified 1D Ising model. Applied to both simulated and publicly available datasets, our approach is compared with other popular methods in terms of performance. Simulated results show that our method is more sensitive than competing methods. Applied to the real data, our method can identify more common DMRs than DMRcate, ProbeLasso, and Wang's methods with a high overlapping ratio. Also, the necessity of integrating the heterogeneity and correlation characteristics in identifying DMR is shown through comparing results with only considering mean or variance signals and without considering relationship of neighboring CpG sites, respectively. Through analyzing the number of DMRs identified in real data located in different genomic regions, we find that about 90% DMRs are located in CGI which always regulates the expression of genes. It may help us understand the functional effect of DNA methylation on disease.


Introduction
DNA methylation is an important epigenetic modification which plays an essential role in gene expression [1,2] and cancers [3][4][5]. Aberrant methylation status, such as hypermethylation in promoter, often leads to gene silencing. It is an important mechanism of antioncogene inactivation [6]. Global hypomethylation always leads to the emergence of cancers through affecting the stability of chromatin [7]. There are pieces of evidence showing that abnormal methylation patterns are related to many cancers and other diseases [8][9][10][11][12][13]. Also, some genomic regions have been found instable in methylation, which increases methylation variability in cancer and then causes cancer heterogeneity [14][15][16]. Therefore, identification of aberrant methylation patterns is important to understand the pathogenesis of diseases.
With the development of high-throughput technologies, there are two main technologies to quantify genomewide DNA methylation, bisulfite microarray, and sequencing which provide great opportunities for revealing the epigenetic mechanisms of diseases. Array technologies, Illumina Infinium HumanMethylation 27 K and 450 K, are often used to study complex diseases owing to their low cost and highresolution ratio popularly. There are two designs of data form, -values and M-values, used in identifying aberrant methylated patterns. -value measures the proportion of methylated intensities out of total intensities, and M-value is calculated as the log2 ratio of the intensities of methylated 2 BioMed Research International probe versus unmethylated probe. The relationship between -value and M-value is shown as the following equation: It is shown that M-value may have more statistically valid for differential analysis of methylation [17]. Nowadays, various approaches have been proposed on the basis of microarray data to extract DNA methylation patterns, including differentially methylated loci [18][19][20][21][22][23][24][25][26] and differentially methylated region (DMR) [27][28][29][30][31][32][33][34][35][36]. Existing DMR detecting methods always consider some data characteristics to develop different assumptions. Bump hunter [27], DMRcate [28], and ProbeLasso [29] were developed through hypothesizing that the mean difference in methylation level of different groups is a primary cause in DMR identification; therefore, DMRs are identified through considering difference of mean signal between normal and cancer samples. Considering the heterogeneity of cancer samples [37], Wang et al. [30] developed an approach based on integrating mean and variance signal to identify DMRs. It is noteworthy that DMR methods integrating more information, such as mean and variance signals, always have better performance than those integrating less information, such as considering only mean signal [30]. Therefore, considering that additional characteristic, highly correlated neighboring CpG sites [38] in methylation data are rarely integrated in existing DMR methods; a data-driven method is developed based on integrating more information of data. Motivated by Ising model which describes matter phase transition considering the strong interaction among neighbor molecular, we consider DNA methylation in genome by 1D (one-dimension) Ising model which can integrate the effect of neighbor sites. For each site, the status depends on its differential significance (p-value) and that of their neighbors with correlation characteristic. Generally, if the status of the site is significant, the more the accordance of the neighbor sites with the site, the lower the energy. If the status of the site is nonsignificant and those of its neighbors are significant, there are strong correlation between the site and its neighbors; we think that the site may also have low energy by integrating all information. The reason is that methylation level of a site is affected by multiple factors expect for disease; the information of neighbor sites can amend the bias of the site caused by other confounders. DMRs are identified as regions with low energy.

Material and Methods
We develop a data-driven approach to detecting DMRs (see Figure 1) which considers the data characteristics, correlation of neighboring CpG sites, and the high heterogeneity.
Step 1 (calculate site-level energy). Motivated by the principle of 1D Ising model, we define the site-level energy as follows: where represents the energy of site in feature ; represents the correlation of sites and in normal samples; and describes the signal value of which represents the difference between tumor and normal samples in feature .
V in function (2) is used to describe whether the site is significantly different between two groups under the feature . Here, if p-value is less than 0.05, we believe that this site can provide energy to distinguish the two types of samples under this feature. Otherwise, no energy is provided. The smaller the p-value is, the more the energy it provides. Therefore, we use the negative log of V for energy when p-value is less than 0.05; otherwise, zero is used.
V is calculated using two paired t-tests and onesided Pitman-Morgan test to describe the mean and variance signals, respectively, which are often used in other works [30]. Considering high heterogeneity of methylation in cancers, we integrate the mean and variance signal to define sitelevel energy by parameter ; that is, ∈ { , V }. For mean and variance signals, the statuses are denoted by and V , respectively. Therefore, the site-level energy is denoted as follows: where represents the weight of mean signal to total signals. For each site, the lower the energy is, the more possible the difference site between case and control is.
Step 2 (identify candidate DMRs). To identify candidate DMRs, we define the total energy of region as follows: We use a greedy algorithm with the following steps to identify candidate DMRs. Considering the question of what conditions of a site are required to add candidate regions, we use a permuted method. First, we permute the sample labels times and calculate the permuted energy of each site using (3). Second, permuted energy of all sites is sequenced in ascending order and the value at 5% is selected as the threshold . Therefore, the greedy algorithm is executed in , (e 1 e 2 , · · · , e n ) …… Figure 1: Flowchart of the proposed approach.
Step 1, single site-level energy is calculated based on modified 1D Ising model.
Step 2, candidate DMRs are identified using a greedy algorithm.
Step 3, for each candidate DMR, the significance is assessed through permuting the sample labels.
the following steps: (1) Select the CpG site with the lowest energy as a seed site if its energy is smaller than , which is considered as the starting regions .
(2) Select the neighbor site of the current region with the lowest energy and add the site to if energy of the site is smaller than .
(3) Continue with the neighbors of and keep adding the site to until energy of the neighbor sites of is greater than or equal to . (4) Choose another site with a seed one in the remaining sites except for and repeat the above steps which can obtain another candidate region until the energy of any site of the rest is greater than .
Step 3 (assess significance of candidate DMRs). To assess significance of candidate DMRs, we need to calculate p-value for each candidate region. We make the null hypothesis that a candidate region is not a DMR. If p-value is less than a significance level, the null hypothesis is rejected. The identification of DMR is equivalent to determining whether a candidate region is associated with the sample label. Therefore, pvalue of a candidate region is calculated through permuting sample label. We complete the process of Step 2 for each permutation. For the -th permutation, we can obtain permuted DMRs. The energy of permuted DMR is denoted by , , = 1, . . . , . For each candidate DMR , the significance is measured by the p-value which is calculated as where ( ) is an indicator function and ( ) represents the numbers of CpG sites in region . To consider the multiple testing, we use a function p.adjust in R and compare the results of different parameters; the results obtained with Bonferroni were the most conservative. Therefore, the pvalues are multiplied by the number of comparisons using Bonferroni correction. The candidate DMR is considered significant if the p-value corrected by Bonferroni is smaller than 0.05.

Results and Discussion
In calculating the energy of each site, we hypothesized that a locus was associated with only two sites adjacent to its left and right for simplifying the correlation characteristic. To show the performance of the proposed method, we compare the method with bump hunting [27], DMRcate [28], ProbeLasso [29], and Wang's method [30] in simulation data. Applying to the real data, we compare the DMRs identified by our method and Wang's method [30] based on -value and Mvalue, respectively. Also, based on M-value, we compare our method with DMRcate [28] and ProbeLasso [29]. Finally, the necessity of integrating characteristics of methylation data is analyzed.

Simulation Study.
We generate simulation data referencing Wang's method [30] which consider the real characteristics of methylation data based on M-value. For case-control design, like Wang's method, we consider methylation measure following a conditional scaled normal distribution: where ∼ (1, 1), = 1 and = 0 represent tumor and matched-control samples, respectively; the vector = ( 1 , 2 , . . . , ℎ ) and Δ = (V 1 , V 2 , . . . , V ℎ ) represent mean and variance signals, respectively; an element Σ in matrix Σ ℎ×ℎ is × | − | which describes the correlation characteristic of neighboring sites and ; and ℎ is the number of consecutive sites in a defined cluster. In each simulation, we generate 100  tumor and control samples with 10000 methylation sites. The genomic positions of the 10000 sites are simulated by that of the first 10000 chromosome 1 on 450k array [30]. The clusters are obtained based on the genomic position of sites using R package "bump hunter". To show the performance of different methods, we select ten clusters randomly and set them as real DMRs as follows: for tumor samples, the first three are simulated mean signal only through setting in vector which follows the uniform distribution ( , ); the next three are simulated variance signal only through setting V = + for tumor samples, where is a basic value and is a random value following (0, 0.5); the last four are both simulated mean and variance signals through adding mean and variance signal in tumor samples by and V . The sensitivity (SE) and specificity (SP) are defined according to the confusion matrix shown in Table 1: We compare the performance of our method with bump hunting [27] and Wang's method [30] in simulated data based on different values of different parameters (see Table 2). To improve the significance, we implement 10 times for each set of parameters and calculate the mean values of specificity and sensitivity. It is shown that our method has higher sensitivity than other methods (see Table 3). Furthermore, to avoid the deviation of the numbers of identified DMRs, the numbers of true positive (TP) and false positive (FP) DMRs are compared through calculating the overlap between identified DMRs and embedded true DMRs (see Table 3).
In this study, an identified DMR is known as a true positive one if the intersection of the DMR and some true DMR contains more than CpG sites. is defined by multiplying the length of the true DMR by . The greater the is, the higher the degree of overlap between the true positive DMR and the real area is. It is also shown that our method has higher matching degree with true DMRs when either = 0.2 or = 0.5. More simulations studies, changing simulation parameters, are described in supplementary material (see Tables S1-S3 for comprehensive analysis). It is shown that our method had better performance than other methods in identifying DMRs when there are high heterogeneity and correlation characteristics of methylation data.

Real Data Application.
The real data we used is breast cancer (BRCA) which is available at TCGA. Preprocessing of DNA methylation data is implemented with reference to [35]. We permute sample labels 500 times and calculate the permuted energy for each site for each time. The value in descending order of 5% is obtained for each time and is obtained averaging these 500 values.
To compare the performance of our method, we apply the method and Wang's method to DNA methylation based on -value and M-value, respectively, in our experiment. The thresholds are set as -2 and -5 for -value and M-value data, respectively. The results show that our method can identify more DMRs based on either -value or M-value data with a high overlapping ratio (see Table 4), especially that based on M-values. Therefore, we implement the next analysis based on the results of M-values data.
Based on M-value, we also compare our method with DMRcate [28] and ProbeLasso [29] and calculate the overlap of different methods (see Table 5). It is shown that our method has more common DMRs than any of the other methods.
Through analyzing the location in genome of these DMRs, we find that about 90% DMRs (7081 in 7871) are located in CGI which are CpG enrichment regions. This is consistent with the fact that aberrant methylation in CGI always influences the expression of genes. To understand the possible functions of DMRs, we evaluate the enrichment of these DMRs according to the location relative to genes (see Figure 2; numbers 1 to 6 represent different regions TSS1500, TSS200, 5'UTR, 1stExon, gene body, and 3'UTR, respectively). It is shown that most DMRs are enriched in gene body which is consistent with the recent studies reporting that aberrant methylation in gene body has essential role in cancer occurrence and development [39]. One example which is identified based on M-value but not for -value is     shown in Figure 3. It has obvious significant difference in variance signal but not in mean signal.

Validation of Identified DMRs.
To illuminate the necessity of integrating the characteristics of DNA methylation data, we compare the results when only considering one signal (mean or variance) and without considering relationship of neighboring sites, respectively. Firstly, we calculate the energy of each site in (3) when = 1 (only mean signal) and when = 0 (only variance signal), respectively, and identify DMRs. Secondly, we do not consider the correlation of neighboring sites; the energy of each site is calculated when = and V = V in (3). Take chromosome 1 as an example, the identified DMRs are shown in Table 6. It is shown that there is prominence in identifying DMRs when integrating mean, variance, and correlation characteristics more than when only considering variance signal ( =0) and not considering correlation characteristic. Although there are more DMRs when =1 than those of integrated method, most of these DMRs contain fewer number of sites. Therefore, we think that the mean signal may be more effective in identifying differentially methylated loci than DMRs.

Conclusions
In this paper, we proposed a data-driven method to identify DMRs through integrating the characteristics of methylation data. Simulation study has shown that our method is more sensitive than the two alternative methods. Through application to real data, we compared the results of DMRs identified based on -value and M-value, respectively, and our method showed further better performance than Wang's method. Based on M-value data, the necessity of integrating all characteristics of data is shown through comparing the DMRs identified by different measures. It is also shown that the integration of multiple information is effective in identifying DMR.
Currently, the available DMR identification methods are insufficient in integrating data characteristics. Most methods only consider mean signal, and high heterogeneity of methylation data is not considered. The recent work by Wang et al. is developed through accounting for high heterogeneity, and they obtained some meaningful results. Therefore, integrating more information to identify DMR is required.
Although our method integrates the characteristics of high heterogeneity and correlation of neighbor sites of methylation data, we only consider the correlation of the two neighbors of the site limited by the Ising model. Therefore, first, we hope to integrate more comprehensive information based on biological a priori knowledge to build an appropriate model in the further work; second, in view of the strong relationships of CpG sites, we hope to identify DNA methylation patterns based on building methylation network which has been widely used in identification of disease-related genes [40][41][42][43].

Data Availability
The real data we used is breast cancer (BRCA) which is available at TCGA (https://cancergenome.nih.gov/).

Conflicts of Interest
There are no conflicts of interest to declare.