Stochastic epigenetic outliers can define field defects in cancer

Background There is growing evidence that DNA methylation alterations may contribute to carcinogenesis. Recent data also suggest that DNA methylation field defects in normal pre-neoplastic tissue represent infrequent stochastic “outlier” events. This presents a statistical challenge for standard feature selection algorithms, which assume frequent alterations in a disease phenotype. Although differential variability has emerged as a novel feature selection paradigm for the discovery of outliers, a growing concern is that these could result from technical confounders, in principle thus favouring algorithms which are robust to outliers. Results Here we evaluate five differential variability algorithms in over 700 DNA methylomes, including two of the largest cohorts profiling precursor cancer lesions, and demonstrate that most of the novel proposed algorithms lack the sensitivity to detect epigenetic field defects at genome-wide significance. In contrast, algorithms which recognise heterogeneous outlier DNA methylation patterns are able to identify many sites in pre-neoplastic lesions, which display progression in invasive cancer. Thus, we show that many DNA methylation outliers are not technical artefacts, but define epigenetic field defects which are selected for during cancer progression. Conclusions Given that cancer studies aiming to find epigenetic field defects are likely to be limited by sample size, adopting the novel feature selection paradigm advocated here will be critical to increase assay sensitivity. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1056-z) contains supplementary material, which is available to authorized users.

from GSE30758, i.e. CpGs that correlate with the risk of CIN2+, show more progressive changes in CIN2+ and cervical cancer we used three other data sets (GSE20080, GSE37020, GSE30759) profiling normal cervical and CIN2+ or cervical cancer samples (SI table S1).
All of these datasets were generated using Illumina Infinium 27k beadarrays and we used the normalized data, as described by us previously [1,2].
Dataset GSE69914 was generated using Illumina Infinium 450k beadarrays and consists of 50 normal breast tissue samples from healthy women, a set of 42 matched normal-adjacent breast cancer pairs (a total of 84 samples), and a further 263 unmatched breast cancers. Raw data was processed (intra-sample normalization) with minfi [3] and type-2 probe bias corrected using BMIQ [4]. Subsequently, we tested for batch effects by performing a SVD on the intra-sample normalized data matrix, and checking which factors (biological or technical) the top components of variation were correlating with. The top components of variation in this data matrix correlated with biological factors, notably normal-cancer status.

Statistical algorithms for Differential Variability (DV)
We compared a total of 5 algorithms/statistical tests, aimed at identifying differentially variable features. The five DV algorithms/tests are (i) Bartlett's test [10], (ii) a novel DV algorithm, which we call "iEVORA" (similar to the original EVORA-Epigenetic Variable Outliers for Risk prediction Analysis algorithm [1,2] ), (iii) a joint test for differential means and differential variance in DNA methylation ("J-DMDV") [11], (iv) an empirical Bayes Levene-type test ("DiffVar") [12] and (v) a test based on a generalized additive model for location and scale ("GAMLSS") [13]. With the exception of iEVORA, which we present here for the first time, all other DV algorithms (i.e. BT/EVORA, J-DMDV, DiffVar, GAMLSS) have been previously used in cancer epigenome or EWAS studies [1,13,14].
BT & iEVORA: Briefly, Bartlett's test (BT) is similar to an F-test for testing homoscedasticity, and is well-known to be sensitive to single outliers. Because of this, we also consider a regularized version of it, which we call iEVORA, whereby features deemed significant by Bartlett's test are re-ranked according to an ordinary differential methylation statistic (e.g. the statistic from a t-test). Thus, in iEVORA, significance is assessed at the level of differential variability using Bartlett's test, but significant DV features with larger changes in the average DNA methylation are favored over those with smaller shifts in average DNA methylation. This re-ranking strategy therefore ensures that DV features driven by single, or a few, outliers are only ranked highly if there are no features which are differentially methylated in terms of mean DNAm levels.
J-DMDV, DiffVar and GAMLSS: The third algorithm ("J-DMDV"), proposed by Wang and Ahn [11], works in the M-value (M=log2[β/(1-β)] basis and uses a joint score test for mean and variance within a linear regression framework. The fourth algorithm ("DiffVar") is based on an empirical Bayes extension of the Levene-test [12]. Briefly, this algorithm first computes the square (or absolute) deviations of samples within a phenotype from the corresponding group (phenotype) mean using the M-value basis. It then uses the framework of moderated t-tests [15], to compare the distribution of deviations between the two phenotypes. The final algorithm ("GAMLSS") was developed by Wahl et al [13] within the GAMLSS (Generalized Additive Models for Location, Scale and Shape) framework. This algorithm also works in the M-value basis, and here we adapt it to run on 3 separate generalized linear additive models within a nested framework: a null model without mean and variance, a regression model for the mean only and a model for the mean and variance.
Two likelihood ratio tests are then constructed by comparing the log-likelihoods of the meanonly model to the null, and the mean+variance model to the mean-only model. This yields two P-values for each feature, and features are deemed significant if at least one of these two P-values is less than a nominal threshold (after adjustment for multiple testing). Thus, GAMLSS will yield significant hits if there are differences in terms of mean DNAm. We also note that our implementation of GAMLSS does not compare a variance-only model to the null, since the algorithm aims to identify additional features where variance "adds predictive value" over a model which only includes the mean.

Evaluation of DV algorithms to detect true DVCs on simulated data
In order to compare the DV algorithms to each other, we devised a simulation framework allowing for different types of differential variability. In each simulation run we generated an artificial DNA methylation data matrix consisting of 6000 CpGs and 100 samples. Samples were subdivided into two phenotypes, a "normal" and a "disease state", each comprising 50 samples. We declared 600 CpGs to be truly differentially variable, allowing for 3 different types of DV, with 200 CpGs in each type. The remaining 5400 CpGs are not differentially variable. These are modelled from a beta-value distribution B(a1,b1) with a1=10 and b1=90, i.e. we assume that these CpGs are generally unmethylated with a mean beta value of 0.1, with a standard deviation of approximately +/-0.03. For the 600 true positives, a proportion of the samples in the "disease" phenotype are modelled from a beta-value distribution B(a2,b2) with a2=6 and b2=4, i.e. a distribution with mean value 0.6 and a standard deviation of approximately +/-0.15. We note that although in this simulation we consider all CpGs to be unmethylated in the normal state, that there is no loss of generality, since mathematically, there is a complete symmetry between unmethylated and methylated CpGs.
Thus, for the 600 true DVCs and for a number of samples in the disease phenotype, there will be an average increase in DNAm of ~0.5. The 600 true DVCs however fall into 3 categories of DV. For 200 of these CpGs, we model all samples in the disease phenotype from B(a2,b2).
Thus, these DVCs will typically also differ in terms of the mean level of DNA methylation and in fact, will exhibit stronger differences in terms of the mean DNAm than in terms of differential variance. Hence, these 200 DVCs are of "type-1a" DV. For another 200 CpGs, we only allow 20 of the 50 disease phenotype samples to be modelled from B(a2,b2), with rest of the samples being modelled from B(a1,b1). Thus, for these DVCs, half of the disease samples exhibit increases in DNAm, with the rest being indistinguishable from the normal phenotype. For these CpGs, differential variance is the key discriminatory characteristic, although they will still exhibit significant differences in terms of mean DNAm since a reasonable fraction of the disease samples exhibit deviations from the normal state. These DVCs are of "type-1b" differential variability. Finally, for the last set of 200 true positives, we only allow 3 disease samples to differ from the normal state. For these DVCs, there is therefore no significant difference in terms of the average DNA methylation between the two phenotypes. However, the variance will differ owing to the outliers in the disease phenotype.
These DVCs are defined as being of "type-2".
We performed a total of 100 Monte Carlo runs, in each run recording 5 performance measures for each of the five DV algorithms: (1) the overall sensitivity of the DV algorithm using an FDR (false discover rate) corrected threshold of 0.05, (2) the true FDR at the estimated FDR < 0.05 threshold where the FDR estimate was obtained using Q-values [16], (3)(4)(5) the sensitivities to detect type-1a, type-1b and type-2 differentially variable CpGs. We focused on the FDR and not the FPR (false positive rate), since it is the FDR which gives us the confidence level that a given positive is a true positive, i.e. the FDR is related to the positive predictive value (PPV) through the relation FDR=1-PPV.

Evaluation of DV algorithms on real DNA methylation data
Initially, we compared the algorithms in their ability to detect DVCs between normal samples from healthy women and normal samples from women who developed neoplasia or who had cancer (see section on DNAm data sets for details), without considering the likelihood of these DVCs being true positives. Thus, for each of the DV algorithms and each CpG site, we estimated P-values, and from these, Q-values (FDR) [16]. In the case of BT and iEVORA, Pvalues came from Bartlett's test. In the case of DiffVar and J-DMDV, both tests provide Pvalues, as described in the respective publications. Features were deemed significant if Q<0.05. In the case of GAMLSS, we obtained two P-values, one assessing whether the mean is associated with the phenotype, and another assessing whether the variance adds predictive value over the mean. Both sets of P-values were transformed into Q-values and features with at least one of these Q-values being less than 0.05, were selected and deemed statistically significant.
To enable a more formal comparison of the DV algorithms, we devised a strategy that would allow us to estimate the positive predictive value (PPV) of the test. The key insight or hypothesis is that DVCs obtained from the previous step are more likely to be biological true positives if they exhibit progressive changes in DNA methylation in either neoplastic or invasive cancer tissue. A feature detected in pre-neoplastic lesions that is biological relevant is more likely to mark cells which become neoplastic and therefore one would expect enrichment of these marks in neoplasia and invasive cancer. This means that if a given "true positive" CpG site exhibits higher DNAm levels (maybe only marginally so) in precursor cancer lesions, that this same site will undergo larger and more frequent changes in DNAm when the cells are neoplastic or invasive. Statistically, this "progression effect" can be measured using t-statistics from a t-test, since the t-statistic is proportional to the average deviation in DNAm from the normal state. A similar argument can be applied to the case of CpG sites that undergo marginal hypomethylation in precursor cancer lesions.
In the context of cervical carcinogenesis, we thus applied the DV algorithms to identify DVCs hypervariable in the 75 normal samples which 3 years later progressed to CIN2+ status compared to the 77 normal samples from women remained healthy (the "ART" data set in SI table S1). The DNAm data for this set were generated on Illumina 27k beadarrays, and so, because of the design of the 27k array, we only focused on DVCs which exhibited increases in DNAm in the precursor lesions. We considered the top ranked 500, 1000 and 5000 DVCs (irrespective of FDR values attaining statistical significance). In the case of GAMLSS, which provides two P-values per feature, we ranked the selected features according to the significance of the DV statistic. In each case, we then computed and compared t-statistics of these top ranked DVCs in two independent Illumina 27k data sets profiling normal and CIN2+ samples, and another 27k dataset of normal cervix and cervical cancers ("CIN2+(A)&(B)" and "CC" in SI table S1). The fraction of DVCs attaining t-statistics larger than 1.96 (P<0.05) and preserving the same directionality of change in the independent data was used as the PPV estimate.
In the context of breast carcinogenesis, we applied the DV algorithms to identify DVCs hypervariable in the 42 normal-adjacent samples compared to the 50 normal samples from healthy women. Because of the design of the 450k array, we now considered DVCs which exhibited either increases or decreases in DNAm in the normal-adjacent samples. We