Identifying hypermethylated CpG islands using a quantile regression model

Background DNA methylation has been shown to play an important role in the silencing of tumor suppressor genes in various tumor types. In order to have a system-wide understanding of the methylation changes that occur in tumors, we have developed a differential methylation hybridization (DMH) protocol that can simultaneously assay the methylation status of all known CpG islands (CGIs) using microarray technologies. A large percentage of signals obtained from microarrays can be attributed to various measurable and unmeasurable confounding factors unrelated to the biological question at hand. In order to correct the bias due to noise, we first implemented a quantile regression model, with a quantile level equal to 75%, to identify hypermethylated CGIs in an earlier work. As a proof of concept, we applied this model to methylation microarray data generated from breast cancer cell lines. However, we were unsure whether 75% was the best quantile level for identifying hypermethylated CGIs. In this paper, we attempt to determine which quantile level should be used to identify hypermethylated CGIs and their associated genes. Results We introduce three statistical measurements to compare the performance of the proposed quantile regression model at different quantile levels (95%, 90%, 85%, 80%, 75%, 70%, 65%, 60%), using known methylated genes and unmethylated housekeeping genes reported in breast cancer cell lines and ovarian cancer patients. Our results show that the quantile levels ranging from 80% to 90% are better at identifying known methylated and unmethylated genes. Conclusions In this paper, we propose to use a quantile regression model to identify hypermethylated CGIs by incorporating probe effects to account for noise due to unmeasurable factors. Our model can efficiently identify hypermethylated CGIs in both breast and ovarian cancer data.


Background
Epigenetic changes are one of the most common molecular modifications in cells [1][2][3][4]. Among different epigenetic changes, DNA methylation, the addition of a methyl group (CH 3 ) to the 5's cytosine (C) at a CpG site, plays an important role in gene expression regulation, transposons silencing, and transcription factor binding inhibition [5][6][7][8][9][10][11][12][13]. Therefore, DNA methylation has significant implications in both normal biology and complex diseases, such as cancer. In fact, DNA methylation patterns change during tumor growth. These changes may include regional or genome-wide gain or loss of methylation [14]. The gain of methylation in cancer is called hypermethylation, that is, there are more methylation signals in cancerous cells than in normal cells. On the other hand, the loss of methylation in cancer is called hypomethylation, that is, there are fewer methylation signals in cancerous cells than in normal cells. Numerous studies have reported that DNA hypermethylation may cause tumor suppressor gene silencing [15,16]. These abnormal DNA methylations usually occur at CGIs, genomic regions rich in CpG sites.
In order to gain an understanding of how genome-wide (especially CGI) methylation changes affect tumor growth, numerous microarray protocols have been developed to simultaneously assay the methylation status of all or partial regions in the whole genome. Most of these microarray protocols are developed based upon one of the following three methods of methylation-dependent treatment of DNA, each with its advantages and disadvantages [16]: (1) using methylation sensitive enzymes (such as HpaII and HinpI) to digest DNA, (2) using specific antibodies or methyl-binding proteins to obtain DNA fragments enriched with methylation signals, and (3) using sodium bisulfite to treat denatured DNA to convert unmethylated cytosine (C) to thymine (T).
In our group, the DMH protocol has been developed to simultaneously assay the methylation status of all known CGIs [17,18] using methylation sensitive enzymes HpaII and HinpI to digest DNA. As opposed to the earlier DMH protocol in which interrogated samples were hybridized onto CGI clone arrays with printed probes averaging 870 bp in length, the current DMH method assays the sample using CGI tiling arrays with much shorter probes (45 -60 bp). Probe affinity, PCR effects, and many other measurable and unmeasurable confounding factors due to shorter probe length affect the observed methylation signals [19].
Previous DMH methylation microarray data analysis methods either propose an arbitrary log ratio cut off of 1.5 to detect differential methylation [20] or focus on modeling differential methylation at the probe level [21]. Due to the large impact of probe affinity and many confounding factors, a single high log ratio probe may not represent true biological signals. Furthermore, it can be misleading to select differentially methylated promoter regions based on independent probe signals. In addition, we are more interested in identifying hypermethylated regions as opposed to local changes detected by a difference in a single probe. To meet these biological interests and needs, we propose the use of a quantile regression model [22] in order to aggregate CGI probe signals for the identification of hypermethylated regions. Probe effects are directly incorporated into this proposed model. Genes with hypermethylated promoters can be easily selected according to their associated CGIs. The idea of using a quantile regression model to identify methylated CGIs was originally presented by our group as a poster at the 4th International Symposium on Bioinformatics Research and Applications. In that poster, we used a quantile regression model at 75% quantile. Although known methylated and unmethylated genes can be identified, we were unsure whether 75% would turn out to be the best quantile level.
In the following sections, we first give a brief introduction to our breast cancer cell line and ovarian cancer microarray data. We then explain how to use a quantile regression model to identify hypermethylated CGIs. Finally, we implement quantile regression models at different quantile levels and compare the performance of these models using three statistical measurements.

Methods
We use two DMH microarray datasets generated from 40 breast cancer cell lines and 26 ovarian cancer patients. In particular, we use the 2-color 244K Agilent arrays hybridized with the test samples (e.g. the breast cancer cell lines) dye coupled with Cy5 (red) and a common normal reference dye coupled with Cy3 (green). The base two log ratio of red over green intensity, log 2 (Cy5/Cy3), is used as the observed methylation signal at each probe. For each array, dye effects are corrected using the standard within array LOESS normalization in the Bioconductor package "limma" [23]. We have explored several normalization methods and found that the standard LOESS normalization produces more consistent and reliable results than the others (data not shown).
In a common DMH experiment, it is desirable to identify CGIs that are hypermethylated in a large percentage of the total N samples (e.g., N cancer patients or N cancer cell lines). Therefore, one important goal of our DMH microarray study is to identify the CGIs that are commonly methylated in N samples (N = 40 for breast cancer data and N = 26 for ovarian cancer data). In order to control the noise due to measured and unmeasured factors such as GC content, scanner effects, and PCR effects that may affect the signals, we apply the following quantile regression model to each CGI: where Q Ysp (τ|sample s , probe p ) is the τ-th conditional quantile of the observed probe log ratio of sample s at probe p, sample s represents the expected signal from the sample, and probe p denotes the probe effect. In the above quantile regression model, error terms are assumed to be independent and distribution-free. The regression coefficients, especially sample s and probe p , are estimated by formulating the quantile regression problem as a linear program [22]. In fact, both parameter estimation and inference are conducted using the R package "quantreg" [22]. An example of using this package to fit a quantile regression model for one CpG island has been provided (see Additional file 1).
In the above regression model, we let τ = 95%, 90%, 85%, 80%, 75%, 70%, 65% and 60%. We choose quantile levels over 50% because we are interested in identifying hypermethylated regions. In particular, for each sample (or cell line) effect from the quantile regression output, there is a p-value indicating whether a sample (or cell line) shows significant methylation signals at one particular CGI under the null hypothesis that sample s (τ) = 0. The methylation level at each CGI is taken as the number of samples for which their associated p-values are less than a certain cutoff value p 0 where we let p 0 = 0.05, 0.04, 0.03, 0.02, and 0.01. For example, if a CGI has p-values less than 0.01 in 38 out of 40 breast cancer arrays, this indicates that this CGI may have very strong methylation signals across many samples.
In order to verify that our quantile regression model can identify the real methylation signals and to compare the results of our regression models at different quantile levels, we use known methylated and housekeeping genes as "positive" and "negative" controls respectively. In fact, 30 known hypermethylated genes [24][25][26][27] have been reported for breast cancer, and 32 known hypermethylated genes have been reported for ovarian cancer [28]. For both breast and ovarian cancer, 47 housekeeping genes [29] are selected as "negative" control (i.e., known unmethylated genes) due to their low methylation signals. Recall that the methylation score given to each CGI is the count of samples with p-value less than a cutoff point. At each p-value cutoff point p 0 , we have a methylation score for each CGI. Then, there are N m and N HK methylation scores with N m = 30 for breast cancer data, N m = 32 for ovarian cancer data, and N HK = 47 for unmethylated housekeeping genes. We choose these N m and N HK genes because each of them is associated with at least one CGI. Therefore, this paper will refer to these genes as N m methylated and N HK unmethylated CGIs.
In order to determine if known methylated and unmethylated CGIs are identified correctly, we use three different statistical measurements for known methylated and unmethylated CGIs. The first measurement is the area under a Receiver Operating Characteristic (ROC) curve, which we call "AUC" (Area Under Curve). A ROC is a graphical plot of the sensitivity vs. (1 -specificity) for a binary classifier system as its discrimination threshold varies. The ROC can also be represented equivalently by plotting true positive rates (TPR) vs. false positive rates (FPR). In this paper, the TPR is the fraction of known methylated CGIs that are correctly classified as methylated CGIs at a specific methylation score level C 0 (0 ≤ C 0 ≤ N). The FPR is the fraction of known unmethylated housekeeping CGIs that are incorrectly classified as methylated CGIs at a specific methylation score level C 0 . The second measurement is the mean difference of methylation scores of two groups. We call this measurement mean.diff, that is x x s Hk 2 are the mean and variance of methylation scores for known methylated and housekeeping CGIs respectively, we call this measurement "T.stat". At each quantile level τ, the larger a statistical measurement is, the more evident that this quantile level is better at identifying methylated and unmethylated CGIs.

Results
Using both breast and ovarian cancer data sets, we compare the performance of the proposed quantile regressions using three different measurements: AUC, mean. diff and T.stat. All comparison results are listed in Tables 1, 2, 3, 4, 5 and 6 with Tables 1, 2 and 3 for  breast cancer data and Tables 4, 5 and 6 for ovarian cancer data. To have a clear view of these tables, we have plotted the summary result for each table in Figure  1, where the top three plots are for the three measurement results based on breast cancer data, and the bottom three plots are for the three measurements based on ovarian cancer data. For all three measurements, the larger a statistical measurement is, the better that a quantile regression model is at identifying the two different groups of CGIs (methylated and unmethylated). In Figure 1, we can see consistent patterns in all three measurements for both breast and ovarian cancer data. That is, 90% (cyan), the 85% (dark green), and 80% (red) are the top 3 lines and these three lines have relatively small variations across different p-values. Therefore, we can conclude that any τ between 80% and 90% could serve well to identify two different groups of CGIs (methylated and unmethylated). We recommend 85% for convenience.
In order to determine if our quantile regression model is better than other available methods, we compare our method with the previous one that uses a 1.5 cutoff value at a probe level [20] using our breast cancer data. A single probe with a large log ratio is not reliable, so we consider the following cases in each CGI: (1) at least 30% of probes with log ratios greater than 1.5, (2) at least 50% of probes with log ratios greater than 1.5, and (3) 100% probes (that is, all probes) with log ratios greater than 1.5. For the above three cases, the AUC is 0.51, 0.52, and 0.50. These small AUCs are mainly due to the fact that some methylated CGIs or genes do not necessarily have many probes with log ratios greater than 1.5. In fact, they are more likely to have several probes with relatively large but less than 1.5 log ratios. We see this pattern very often in our data. As for the first case, only 3 out of 30 known methylated genes and 4 out of 47 HK genes have at least one cell line with more than 30% probes that have log ratios greater than 1.5. As for the second case, only 2 out of 30 known methylated genes and 1 out of 47 HK genes have at least one cell line with more than 50% probes that have log ratios greater than 1.5. As for the third case, 0 out of 30 known methylated genes and 0 out of 47 HK genes have at least one cell line with 100% (i.e., all) probes that have log ratios larger than 1.5. Therefore, our quantile regression method is certainly much better than the one that uses 1.5 as a cutoff. In addition, the 1.5 cutoff method may work well for our previous version DMH protocol that has longer printed probes (about 870 bp). However, this arbitrary cutoff method does not work for the current protocol that uses much shorter probes (45~60 bp).

Discussion
The three measurement plots of breast and ovarian cancer data have slightly different patterns. This may be due to the sample differences between the two datasets. Breast cancer data are generated from cell lines while ovarian cancer data are generated from patients. The breast cancer cell line samples are more homogeneous than ovarian patient samples. In addition, cancer cell lines appear to have more methylation than cancer patients. Furthermore, breast cancer data have 40 arrays and ovarian cancer data have 26 arrays. This sample size difference may also explain some inconsistencies between breast and ovarian cancer data at different quantile levels due to random variability. We also observe that the results of the three proposed measurements show slightly inconsistent patterns. This may be due to the definition of the three measurements. AUC and T.stat both consider the variations of methylation scores. However, mean.diff only considers the difference of mean methylation scores between methylated CGIs and unmethylated housekeeping CGIs. Therefore, the result of AUC and T.stat may be more reliable.

Conclusions
In this paper, we have proposed to use a quantile regression model to identify hypermethylated CGIs. In particular, we have incorporated probe effects to take    into consideration the noises from unmeasurable factors.
stat. These measurements show that the quantile level between 80% and 90% might serve well for identifying methylated and unmethylated CGIs. Although this paper has only demonstrated how to identify hypermethylated CGIs by setting quantiles larger than 50%, our quantile regression model can also be used to identify hypomethylated CGIs with quantiles smaller than 50%, if desired.

Additional material
Additional file 1: R code for fitting a quantile regression model. This file gives an example of using the R package "quantreg" to fit a quantile regression model to identify methylation signals in one CpG island.   Figure 1 Comparisons of quantile regression models at different quantile levels. We compare results of different quantiles by studying their performances on identifying two different groups of CGIs (methylated and unmethylated). The legend is: "brown" for τ = 95%, "cyan" for τ = 90%, "dark green" for τ = 85%, "red" for τ = 80%, "green" for τ = 75%, "blue" for τ = 70%, "black" for τ = 65%, and "purple" for τ = 60%. The top panel contains three plots for breast cancer data while the bottom panel contains three plots for ovarian cancer data.