The correlation between CpG methylation and gene expression is driven by sequence variants

Gene promoter and enhancer sequences are bound by transcription factors and are depleted of methylated CpG sites (cytosines preceding guanines in DNA). The absence of methylated CpGs in these sequences typically correlates with increased gene expression, indicating a regulatory role for methylation. We used nanopore sequencing to determine haplotype-specific methylation rates of 15.3 million CpG units in 7,179 whole-blood genomes. We identified 189,178 methylation depleted sequences where three or more proximal CpGs were unmethylated on at least one haplotype. A total of 77,789 methylation depleted sequences (~41%) associated with 80,503 cis-acting sequence variants, which we termed allele-specific methylation quantitative trait loci (ASM-QTLs). RNA sequencing of 896 samples from the same blood draws used to perform nanopore sequencing showed that the ASM-QTL, that is, DNA sequence variability, drives most of the correlation found between gene expression and CpG methylation. ASM-QTLs were enriched 40.2-fold (95% confidence interval 32.2, 49.9) among sequence variants associating with hematological traits, demonstrating that ASM-QTLs are important functional units in the noncoding genome.

Other supplementary materials for this manuscript include the following summary data which can be downloaded from our website (www.decode.com/summarydata/):Data-S1.tab† : Summary of sequence variants associated with 5-mCpG rates of CpG units Data-S2.tab† : Coordinates of MDSs Data-S3.tab† : Summary of sequence variants associated with 5-mCpG rates of MDSs Data-S4.tab† : Summary of MDSs associated with expression of mRNA isoforms Data-S5.tab† : Summary of CpG units associated with expression of mRNA isoforms 1. Supplementary Notes 1.1.Nanopore sequencing and oxBS performed in the same DNA samples We performed a comparison between nanopore sequenced DNA samples isolated from 132 individuals that had previously been analysed by oxidative whole-genome bisulfite sequencing (oxBS) 32 .We showed that the average coverage of nanopore sequenced DNA samples is indicative of how consistent the 5-mCpG rates are with CpG methylation measured by oxBS (Supplementary Figure 2A&B).We apply this result in our study by excluding individuals whose DNA samples have been sequenced to less than 10x average coverage.
We binned each CpG according to the number of sequences used to compute its 5-mCpG rate in each sample analyzed by nanopore sequencing.Based on the Pearson's correlation coefficient we then determined how consistent the 5-mCpG rates in each such bin were to 5-mCpG rates measured with high coverage (>25 sequences for each CpG site) by oxBS (Supplementary Figure 2C).The results indicated that, for each DNA sample, the 5-mCpG rate was more reliably estimated when more than 9 sequences were used in the estimate.
For each CpG-unit, we calculated the average 5-mCpG rate across DNA samples from individuals measured by nanopore sequencing (n=7,179) and oxBS (n=132) to then compute the Pearson's correlation coefficient across averaged 5-mCpG rates.We refer to this correlation as per CpG average Pearson correlation (PCAPC).By looking at all CpGs detected by both methods, we found strong correlation i.e., PCAPC=0.9594(95%CI:0.9594,0.9595).
We identified five different attributes of CpG units (see methods section entitled: CpG methylation analysis by nanopore sequencing) that negatively affected the PCAPC (Supplementary Figure 3A), and we removed those CpG units from the study (Supplementary Figure 3B).

Long range phasing of sequences detected by Nanopolish and oxBS
Phasing of the sequences is important for resolving heterogeneity in the data as phasing enables us to quantify 5-mCpG rates of CpG sites separately for each of the two parental chromosomes.For arrays such as HumanMethylation BeadChip developed by Illumina, phasing is not possible and phasing of short sequences from whole-genome bisulfite sequencing is also problematic as, due to the short lengths of sequenced DNA, it is less likely that the sequences contain a known sequence variant that is heterozygous in the individual.Commonly used long-read phasing tools include Margin (github.com/UCSC-nanoporecgl/margin) and LongPhase.We used a different approach, as we have nearly complete parent of origin information for our genotype data 56 , we could assign each sequence to a parental haplotype based on agreement with the haplotypes (see methods section entitled: Assignment of 5-mCpG status to parental haplotypes).
Short-read sequencing reads were assigned to parental haplotypes by examining the phasing status of 17,649,586 sequence variants and one heterozygous variant needed per sequence 32 .7.6% of sequences detected by oxBS were successfully phased whereas 39.5% of sequences detected by nanopore sequencing were successfully phased.The main limitation for the long-read phasing is the set of heterozygous SNPs per individual, as 97.6% of the nanopore sequences covering at least 1 heterozygous SNP, were phased.
On average, 21.87% of the CpGs were phased for oxBS per sample, but 81.05% for nanopore sequencing.
Additionally, we found that 8.61% of aligned bases were phased for oxBS, but 71.20% for nanopore sequencing.
The phasing is affected by sequence length which explains the fact that despite applying stricter conditions on the phasing for long-read sequencing we were still able to phase a higher fraction of sequences and more CpGs on average per individuals than for short-reads.Average sequence length (N50) for phased sequences was 15,798 bp for nanopore sequencing, while average length for sequences that were not phased was 9,732 bp.
When working with phased data, the set of high-quality CpGs should be refined to exclude CpGs where the phasing is often impossible.We calculated the fraction of phased sequences out of all sequences covering SNPs that we are able to phase for each CpG and suggest excluding CpGs where the fraction of phased sequences is less than 0.3 as performed in this study (see methods section entitled: CpG methylation analysis by nanopore sequencing).

Replication analysis: GoDMC study
The GoDMC 40 study is currently the largest study on the effect of sequence variants on 5-mCpG although testing less than 450 thousand CpGs (versus 15.3 million CpG units measured in our study).152,794 of the sequence variants identified as cis-methylation QTLs by GoDMC were mapped to our high-quality set of ~34.4 million sequence variants according to chromosome position and alleles.Note, in line with our analysis, we restricted to GoDMC sequence variants located within 100kb of the affected CpG.112,193   out of the 152,794 (73.4%) sequence variants identified as cis-methylation QTLs by GoDMC were in high LD (r2>0.80) to those identified in our study in association with 5-mCpG rates of individual CpG units.
To determine whether the observed overlap with GoDMC sequence variants is higher than expected by chance alone, we binned the ~1 million 5-mCpG associated sequence variants identified in our study, hereafter referred to as the observed sequence variants, into seven MAF bins (0-0.01,>0.01-0.05,>0.05-0.1,>0.1-0.2,>0.2-0.3,>0.3-0.4,>0.4-0.5).We counted the number of observed sequence variants in each MAF bin and sampled the same number of sequence variants from the 34.4 million sequence variants (located within 100kb from each of the 15.3 million measured CpG units).Hence, we obtained ~1 million sampled sequence variants matched to the observed sequence variants with respect to MAF.We found that approximately 24.4% of the ~1 million sampled sequence variants were in high LD to at least one sequence variant identified by GoDMC as a cis-methylation QTL.The 73.4% observed overlap with our findings is significantly higher than the 24.4% expected overlap (Binomial test, P <0.05).
Further, we tested the GoDMC cis-methylation QTLs in our cohort using the same sequence variants and CpG positions as reported by GoDMC.The effect sizes observed in our cohort were consistent with those reported by GoDMC (Pearson's r =0.736, 95%CI:0.734,0.739).
We note, however, that the effect sizes in our cohort tended to be lower than those reported by GoDMC; the regression effect on GoDMC effect sizes for deCODE effect sizes was  =1.57(97%CI:1.56,1.58).5-mCpG rates measured using >12 sequences were highly consistent with rates measured by bisulfite sequencing (Supplementary Figure 2C).By restricting to haplotypes where >12 sequences were available for measuring the 5-mCpG rate we demonstrated that both the coverage/read depth of the CpG unit and the number of observations, i.e. the number of haplotypes measured with >12x coverage/read depth, were likely important contributors to the observed bias towards lower effect sizes in our cohort (Supplementary Figure 7).

Enrichment analysis: ASM-QTLs stratified by TSS proximity
We find that 38.4% (38.0,38.7%) of ASM-QTLs (i.e., 30,889 out of 80,503) reside within 25kb from the TSS of a protein coding gene.53.1% (95%CI:52.4,53.8%) of GWA signals reside within 25kb from the TSS of a protein coding gene.Hence, in comparison to index variants of GWA signals, the index variants of ASM-QTLs are less frequently found near TSSs (<25kb).
We therefore sought to explore the question of whether proximity to protein coding TSSs influences the enrichment of ASM-QTLs among GWA signals.
To this end we binned sequence variants (other than missense and loss of function) according to distance to the nearest TSS as follows: >25kb (distal), 10-25kb, 5-10kb, <5kb.We then determined the enrichment of these non-coding sequence variants depending on whether or not they are ASM-QTLs.
Note, in this model, non-coding sequence variants located >25kb from a protein coding TSSs were used as the baseline annotation (neutral enrichment).The corresponding category of ASM-QTLs, i.e. those located >25kb from protein coding TSSs, were enriched by 19.3-fold (95%CI:13.3,26)among GWA signals (Supplementary Figure 8).This estimate is similar to the 23.2-fold enrichment of ASM-QTLs among GWA signals that we computed in a model without considering TSS proximity (Extended Data Figure 7).
Hence, we show here that non-coding sequence variants located close (<5kb) to protein coding TSSs are enriched among GWA signals.ASM-QTLs, however, are far more enriched among GWA signals regardless of whether or not they are located close to protein coding TSSs.

Summary of main limitations and strengths of 5-mCpG detection by nanopore sequencing
Currently, the main limitations of using nanopore sequencing for 5-mCpG detection are as follows: 1) sequencing coverage in the DNA sample as low coverage is indicative of unreliable measurements of 5-mCpG rates in the sample.2) The presence of a sequence variant near or within a CpG site is problematic as this may cause erroneous 5-mCpG detection by some algorithms, like Nanopolish. 3) The observation of strand bias in 5-mCpG detection at a CpG site is fairly frequent and we demonstrated that this is indicative of erroneous 5-mCpG detection.
The main strengths are as follows: 1) Nanopore sequencing delivers longer sequences than other sequencing methods, including whole genome bisulfite sequencing.This property of nanopore sequencing considerably enhances the efficiency of phasing the sequences.Phasing is important for resolving the heterogenity in the measurements of 5-mCpG rates, a task that cannot be accomplished with array-based methods.2) The large number of CpG sites detected by nanopore sequencing allows us to access regions in the genome that are not accessible using array-based methods, thereby opening avenues for making novel discoveries.
In our previous work 64 we compared, in more detail, different algorithms and long read sequencing technologies for detecting 5-mCpG status in DNA samples.We note that Remora (of Guppy) exhibits less sensitivity to sequence variation in close proximity to the CpG site thereby enabling high-quality measurements of more CpGs than can be reliably measured by Nanopolish.

Legends for Extended Data Figures
Extended Data Figure 1: Methylation depleted sequences.Lines on the y-axis represent unmethylated haplotypes found in different individuals that overlap in position, with chromosomal position on the x axis.(a) For each such "cluster of overlapping unmethylated haplotypes", as shown here, we look for the most frequently occurring unmethylated haplotype, shown in red.This "dominant" form is then taken as the representative for the cluster of unmethylated haplotypes, which then enables comparison in 5-mCpG haplotype rates over the same coordinates (those of the representative) across all individuals in the cohort.(b) An example cluster where two (or more) representatives were found which we identify, and measure separately as distinct entities.
Extended Data Figure 2: Validation of ASM-QTL influences in oxBS-seq data.ASM-QTLs validated using oxBS in independent DNA samples derived from whole blood of forty-five individuals.Note, these forty-five individuals were not included in the larger cohort of 7,179 nanopore sequenced individuals.The effect size of each ASM-QTL on the 5-mCpG haplotype rates of MDSs in oxBS sequenced individuals, y-axis, are plotted against the effect size of the same ASM-QTLs on 5-mCpG haplotype rates of the corresponding MDS in nanopore sequenced samples, x-axis.We binned the validation tests according to the number of haplotypes (n) that were available for validation testing in oxBS individuals as indicated on the top of each figure.A least squares regression line, where the effect sizes in the oxBS validation cohort are used as the outcome and those in nanopore sequenced cohort as the predictor, is shown in each figure, blue color, along with a diagonal line of identity, red color, as we expect the regression coefficient to be equal to one (and an intercept of zero) if the effect sizes are identical in the oxBS and nanopore sequenced individuals.The regression coefficients and their 95% confidence intervals are shown at the top left-hand corner of each figure.Note that as the number of informative haplotypes (n) in the oxBS validation cohort increases, the regression coefficient approaches that of a diagonal line indicating that the effect sizes become more similar as the number of haplotypes available for testing (n) in the oxBS cohort increases.Extended Data Figure 4: Functional attributes of ASM-QTLs.ASM-QTLs were more likely than expected by chance, to be found in high LD (r 2 >0.80) to A) sequence variants that influence the DNA binding affinity to transcription factors SPI1, CTCF and EBF1, and cohesin protein STAG1 and B) sequence variants located within regulatory sequences defined by ENCODE 33 , i.e., candidate cis-regulatory elements (version 3).Note, sequence variants that influence ASBs were identified in Chen et al 42 .

Extended Data
The P-values shown, unadjusted for multiple comparison, were computed using our permutation-based method described under methods section "ASM-QTL in functional annotation maps".

Figure 3 :
MDSs associated with gene expression.Distance, x-axis, from MDS midpoints to the TSS locations of the associated mRNA transcript isoforms versus the effect size on expression, y-axis.MDS midpoints are represented as black dots, and the MDSs are represented as ranges i.e., horizontal lines.(a) Distal associations i.e., where the MDS does not contain the TSS of the associated mRNA.(b) Promoter associations i.e., where the MDS contains the TSS of the associated mRNA, where the MDS is represented by a line (left to right) indicating the position relative to the TSS of the associated mRNA.(a) and (b) TSS is represented as the origin of the x-axis (x=0), shown as a vertical dashed line, red.Note, we modify the sign of the distance such that negative distances reflect upstream positions of the MDS midpoint relative to the TSS of the associated mRNA.Hence, the upstream positions indicate that the MDS midpoint does not intersect with the direction of mRNA transcription.In contrast, downstream positions indicate that the MDS midpoint does intersect with the direction of mRNA transcription.

Extended
Data Figure 5: ASM-QTLs dominate in correlations between MDSs and mRNA.(a) Example of an association found between expression of VAMP5-201 (ENST00000306384), y-axis, and 5-mCpG haplotype rates, x-axis, of an MDS (chr2:85584168-85584976) located within the CpG island promoter of the TSS; i.e., an example of CpG island promoter methylation.(b) The fraction of variance in VAMP5-201 mRNA isoform expression explained by (top) ASM-QTL found in association with the MDS (middle) CpG methylation of the MDS and (bottom) CpG methylation of the MDS after correction for the ASM-QTL.This same association between MDS at chr2:85584168-85584976 and mRNA expression of VAMP5-201 is then shown separately on (c) the reference-and (d) the alternative alleles of ASM-QTL chr2:85580659:IG.0:0,respectively.Least squares regression line is shown within each of three scatter plots, in blue color, and the estimated regression coefficient and their 95% confidence intervals are shown at the bottom, along with the corresponding two-sided P-values.Extended Data Figure 6: Modeling the impact of ASM-QTL on methylation and expression.Model diagrams (leftmost column) describing the four different hypotheses on the mechanism by which an ASM-QTL genotype (G) influences CpG methylation (M) and gene expression (E).The outcomes of two different tests (MR-Steiger and our method of comparing   and   ) are shown; "v" shape indicates that the model is supported by the test, whereas "x" indicates that the model is not supported by the test and empty cells indicates that the test is unable to consider whether or not the model is supported.For clarity, each model is labelled with a number which are then used as reference to each diagram in the main text and in Figure 3. Extended Data Figure 7: Varied contribution of sequence variants to human trait variability.Enrichment, x-axis, of ASM-QTLs and other sequence variant annotations, y-axis, among GWA signals found in various human diseases and other traits.The solid points (black) represent the measure of center, i.e., the enrichment point estimates and the horizontal lines (black) represent their 95%CIs.The number of sequence variants in each annotation is shown within parentheses on the y-axis.Shown are the enrichment point estimates, points, and their 95% confidence intervals, horizontal lines, for each sequence variant annotation, and the point of neutral enrichment on the x-axis as vertical line, blue.UTR=Untranslated Region, DHS=DNase hypersensitivity site, MDSs=Methylation depleted sequences.6.List of Source Data 6.1.Source data files for the main figures