DNA co-methylation has a stable structure and is related to specific aspects of genome regulation

DNA methylation (DNAm) is influenced by genetic and environmental factors, and can be used to understand interindividual variability in genomic regulation. Co-methylation between DNAm sites is a known phenomenon, but the architecture of relationships between the approximately 450,000 (450k) sites commonly measured in epidemiological studies has not been described. We investigate whether interindividual co-methylation structure amongst the 450k sites changes with age, whether it differs between UK-born White (n=849, 910, 921 and 424) and Pakistani ancestry (n=439) individuals, and how it relates to genome regulation. We find stability between birth and adolescence, across cohorts, and between two ethnic groups. Highly correlated DNAm sites in close proximity are heritable, but these relationships are weakly influenced by nearby genetic variants, and are enriched for transcription factor (TF) binding sites related to regulation of short RNAs transcribed by RNA polymerase III. Highly correlated sites that are distant, or on different chromosomes (in trans), are driven by common and unique environmental factors, with methylation at these sites less likely to be driven by genotype. Trans co-methylated DNAm sites are enriched for multiple TF binding sites and for inter-chromosomal chromatin contact sites, suggesting DNA co-methylation of distant sites may relate to long-range cooperative TF interactions. We conclude that DNA co-methylation has a stable structure from birth to adolescence, and between UK-born White and Pakistani individuals. This stable structure might have implications for future design and interpretation of epigenetic studies. We hypothesise that co-methylation may have roles in genome regulation in humans, including 3D chromatin architecture.


Introduction
DNA methylation (DNAm) is an epigenetic modification to DNA that can be influenced by both environmental and genetic factors (1)(2)(3).It has roles in a variety of genomic functions, including a complex and context-dependent alteration of gene expression (4-7); both altering and being altered by transcription factor binding (8)(9)(10)(11)(12)(13); and interaction with chromatin states (14)(15)(16)(17).Standardized DNAm assays have enabled epigenome wide association studies (EWAS), where each assayed DNAm site is tested independently for association with a variable of interest.These studies have identified DNAm sites associated with many exposures and diseases (18)(19)(20)(21)(22)(23).Networks of co-methylated DNAm sites have also been described -these utilise the relationships between DNAm sites to infer regulatory pathways by which DNAm might relate to health or disease (24)(25)(26)(27).In contrast to genetic architecture, our knowledge of relationships between these commonly assayed DNAm sites at the population level is still limited, constraining our understanding of the principles of epigenetic epidemiology underlying co-methylation between and across populations, and whether comethylation has a role in genome regulation.In addition to this need for greater biological understanding, the co-methylation structure is an important consideration for the statistical design of EWAS and differentially methylated region (DMR) studies; no equivalent of genetic linkage disequilibrium (LD) matrices yet exists for imputation in DNAm analyses.A deeper understanding of DNAm correlation structure, and its stability within and between individuals at the population level, might help identify whether this structure can be used for tagging, pruning and imputation in EWAS in a similar way to how LD enables these approaches to be used in genome-wide association studies (GWAS).
Both whole genome bisulfite sequencing (WGBS) and array-based studies highlighted that DNA methylation forms local correlation structures, with DNAm sites within 1-2kb often having correlated methylation states (28)(29)(30)(31)(32)(33)(34); WGBS data shows that immediately adjacent sites almost always have the same methylation state (28).Correlations between close DNAm sites (referred to here as cis correlations) may be driven by genetic variants, as regions of highly correlating DNAm sites have been associated with nearby SNPs (30,33); although other studies have shown that correlations between DNAm sites can be driven by environmental exposures (35).Co-methylation structure does not mirror the large blocks that linkage disequilibrium (LD) forms (30), and it appears to be consistent across ethnic groups with different genetic architecture (32); this is because comethylation depends on DNA methyltransferases and demethylases, whereas LD is determined by demographic history, recombination and mutation rate.There have been conflicting reports of whether DNAm correlation structure is related to genomic annotations such as CpG islands (30,34), but recent work has demonstrated that DNAm sites within 2Mb at which methylation level co-varies due to the same causal genetic variants can be used to predict contact sites for chromatin loops (36).
Co-methylation between distant DNAm sites on the same chromosome, and sites on different chromosomes (referred to here as trans correlations), are less well described.Correlation of a limited number of trans-chromosomal DNAm sites with highly variable DNAm levels showed that DNAm sites related to HOX genes were highly correlated, suggesting trans-chromosomal correlations might be related to biological pathways (35).Another recent paper using mouse tissue has shown that DNAm sites around inter-and intra-chromosomal chromatin contact points have correlated methylation states, with those within the same topologically associating domains and with the same chromatin states having more correlated methylation states (16).This suggests correlations between DNAm sites are likely to be relevant to genome regulation; however this has yet to be shown in human studies for inter-chromosomal contacts.
DNAm is a dynamic epigenetic mark, so it is unsurprising that measurements at individual sites are not always stable.Sites influenced by genotype tend to be more stable than those that are not; twin models have found that the heritability of DNAm is on average 19% (1,37,38), with a subsequent study showing the reliability of each of DNAm probe measurement is associated with the heritability of the probe (39).DNAm changes with age (40)(41)(42) and in response to environmental exposures such as smoking (18,43), but environmental and genetic constraints have been shown to contribute to the stability of DNAm over the lifecourse (37).As yet there is no indication of how stable relationships between DNAm sites might be; a lack of stability might indicate co-methylation has changing functions or changing environmental influences, and would limit the utility of correlated methylation states in inferring genome regulation over different datasets.It is therefore important to assess correlation structure over time in the same individuals, and across datasets (to ensure replication).An important consideration when investigating stability is utilising datasets that include diverse social groups; DNA methylation is associated with adversity (44,45), racial discrimination (46,47), social inequalities (19,48), and environmental exposures such as air pollution (49,50).
Consequently, co-methylation structure is more likely to reflect stable biological processes if observations persist across diverse social groups (51).
In this paper, we outline DNA co-methylation structure in blood across DNAm sites featuring on the widely used Illumina 450k Beadchip (450k array) which mainly covers promoters, TSS, and coding transcripts across 1.5% of the methylome (52,53).We use data from two large UK birth cohorts -the Accessible Resource for Integrated Epigenomic studies (ARIES), is a sub study of the Avon Longitudinal Study of Parents and Children (ALSPAC), which recruited pregnant women in the South West of England with predicted delivery dates between April 1991 and December 1992; and Born in Bradford (BiB), a birth cohort recruited in a city in the North of England that recruited pregnant women between with predicted delivery dates between March 2007 and November 2010.ARIES has longitudinal DNAm data from 849, 910, and 921 White British participants, at birth (cord blood), 7 and 15-17 years, respectively; BiB recruited in a city with high levels of deprivation, where 55% of the obstetric population are South Asian (mostly Pakistani).The BiB subcohort with DNAm data is approximately evenly split between two UK-born ethnic groups, with 424 white British and 439 Pakistani participants, with DNAm measured at birth (in cord blood).In our analyses we split BiB by ethnic group (where membership of ethnic groups was obtained through maternal self-report, and we conceptualise ethnicity as a social construct that is associated with differing social and environmental exposures, and therefore potentially differing effects on the methylome).We assess the stability of the co-methylation structure between DNAm sites over time, across ARIES and BiB, and between two ethnic groups born in the same geographic area.We detail genetic and environmental influences on this correlation structure, and provide the first comprehensive analysis of strong cis and trans co-methylation between DNAm sites across the 450k array at the population level, outlining their distinct potential roles in genome regulation.We also provide a resource which can be used by the scientific community [DOI 10.5523/bris.31uze72mt042g2ticr0w6z6v8y].

Results
Co-methylation structure and stability, across the 450k array

Overall co-methylation structure
To assess co-methylation structure, we adjusted DNAm data for age, sex, cell counts and batch effects and created a correlation matrix (see Methods, Figure 11) between all remaining possible pairs of sites on the 450k array in ARIES (n sites=394,842), and all sites on the EPIC array that also feature on the 450k array in BiB (n sites=369,796).To assess features of correlations of different strengths, co-methylated pairs were aggregated into bands of correlation, ranging from -1 to 1, in increments of 0.1 (see Methods for details).For all 5 datasets, the distributions of all possible pairwise correlations are positively skewed, and 83-87% of correlations are between -0.2 and 0.2 (Figure 1; see supplementary figure 1 for the plot split by cis and trans correlations).To investigate whether physical proximity has an influence on co-methylation, we defined cis as within 1Mb and trans as over 1Mb.On average, just 12.6% of cis and 14.5% of trans pairs had a correlation above 0.2 or below -0.2.Physical proximity between DNAm sites influences the likelihood of them having highly correlated methylation states (Table 1) -for the strongest positive correlations (R=0.9-1)we see a greater proportion of cis than trans correlations (p=<2.2e-16 in all datasets using a chi squared test).We also see a trend of stronger correlations in BiB than in ARIES -for the strongest positive correlations (R=0.9-1)we see a greater proportion of trans correlations in the BiB cohort than in ARIES (p=<2.2e-16using a chi squared test at birth between ARIES and the BiB white British individuals), and for strong positive correlations (R>0.5)we see a greater proportion of both cis and trans correlations in BiB than in ARIES (p=<2.2e-16using a chi squared test at birth between ARIES and the BiB white British individuals).However this may be due to a thresholding effect; see section "Stability of correlations across time, datasets, and ethnic groups".As strongly co-methylated pairs (R>0.5 and R<-0.5) reflect less than 15% of all correlations, they can be viewed more clearly in Table 1 than Figure 1.Cis co-methylation structure across the genome (as measured by the 450k array) To investigate the influence of physical distance on the cis co-methylation structure in more detail, decay plots were created across all autosomes (as measured by the 450k array) separating positive and negative correlations.In line with previous work, cis correlations within 10kb across the whole genome reveals a smooth decay that reduces to background correlation (about 0.125) at approximately 3kb (28,30,34) and the decay is identical across population cohorts and across ethnic groups (32).This confirms that it is unlikely to be driven by LD, partly because the decay is over a vastly smaller distance (see LD decay plot in (54) for comparison), and partly because the decay is identical for two different ethnic groups.Furthermore, the decay shows a high degree of heterogeneity and is identical between birth and adolescence, meaning cis correlation does not solely depend on physical distance (Figure 2, supplementary figure 2).In contrast to positive correlations, we also found for the first time that negative correlations are not distance dependent as immediately adjacent sites show the lowest correlations and have a slight peak between 2.5 and 4kb.

Figure 2: Decay plots of pairwise cis correlations (n=3,114,257, calculated using the biweight mid-correlation) from all filtered sites within 10kb of each other on the 450k array, across all autosomes (in ARIES 7 year olds;
plots for all datasets can be found in supplementary figure 2).The variance in each bin (represented here by the bin standard deviation) was added to the plot to demonstrate the heterogeneity around the binned estimates.

Stability of correlations across time, datasets, and ethnic groups
To assess whether strong correlations between DNAm sites (R>0.8,n=2,483,055 to 13,145,092 across the 5 datasets) differ across time, cohorts, and ethnic groups, we plotted mean difference (i.e. for each CpG plotting mean correlation vs difference in correlation between two groups).We found the 95% confidence intervals include a mean correlation change of zero in all tests, so we do not find strong evidence of a difference in the strength of either cis or trans correlations between any of our datasets.This suggests that correlations R>0.8 between DNAm sites are relatively stable (Table 2; supplementary figure 3).There are smaller mean changes in correlation between birth and 7 years (-0.015 for cis; 0.006 for trans) than between 7 years and adolescence (0.023 for cis; 0.028 for trans), but there is insufficient evidence to suggest that co-methylation changes with age.
Correlations are consistently stronger in the BiB white British individuals compared to ARIES at birth (0.085 for cis; 0.088 for trans), which is reflected in the higher proportion of correlations R>0.5 in BiB identified above.In contrast, the mean difference between BiB white British and Pakistani groups is very small (0.003 for cis; 0.005 for trans), suggesting that between-cohort differences are stronger than between ethnic groups.Supplementary figure 3 shows that there are a small number of DNAm sites at which there are large changes in trans correlation between birth and 7 years in ARIES, between ARIES and BiB white British individuals, and between the two ethnic groups in BiB.

Influence of genetic factors on co-methylated sites
To assess whether co-methylated sites are influenced by genetic factors, we used twin heritabilities of DNAm sites (1), and we assessed the proportion of correlations which had zero, one, or two of the DNAm sites in each correlating pair associated with an mQTL (see Methods).Highly heritable DNAm sites tend to be strongly co-methylated (r>0.9), in cis but not trans, and this is consistent across datasets, with a mean heritability of 77% for DNAm sites correlated r>0.9 in cis.Looking at this in more detail, the heritability of both sites in highly correlating pairs (r>0.9)tend to be matched only for cis-correlating pairs, suggesting that very strong co-methylation is related to heritability in cis but not in trans (see Figure 3B).This may be related to genetic variants ( 55), as at least 84% of cis correlations >0.9 have both DNAm sites associated with a cis mQTL (not necessarily the same mQTL) across the 5 datasets (see Figure 3 and supplementary figure 4).To identify whether co-methylation is influenced by mQTL, we firstly assessed the proportion of correlations which had zero, one, or two of the DNAm sites in each correlating pair associated with an mQTL.At least 84% of cis correlations >0.9 have both DNAm sites associated with a cis mQTL across the 5 datasets, in line with the heritability findings above.Trans correlations >0.9 are most likely to have neither DNAm site associated with a cis mQTL in 4 of the datasets (48-55%; in the BiB white British group only 31% are associated with 0 mQTLs).This suggests that strong trans correlations are between sites that are less likely to have been associated with an mQTL (although it is possible they have shared genetic aetiology we have not detected -trans sites are less heritable and there are fewer detected trans mQTLs at present (56,57)).There are slight increases in the proportion of cis correlations that have both DNAm sites associated with an mQTL in the ARIES adolescents.This is noticeable for cis correlations between -0.8 and -0.6, and between 0.8 and 0.9 (see Figure 4 and supplementary figure 5).
To further quantify the impact genotype may have on correlations between DNAm sites in close proximity, we assessed the extent to which cis SNPs impact the decay of correlation between cis correlating DNAm sites within 10kb.To do this we adjusted DNAm data for the strongest cis SNP (as identified by the largest mQTL study to date, GoDMC ( 56)), and reproduced the decay plot with this adjusted data (for ARIES, 11719 sites (54.8%) had an mQTL; for BiB, 11108 sites (56%)had an mQTL).
We find that in both ARIES and BiB, there is limited impact of the strongest cis SNP on correlation structure; sites in close proximity are most affected, with a maximum reduction in bin correlation of 0.06 in ARIES 7 year olds, 0.04 in the BiB white British participants, and 0.035 in the BiB Pakistani participants.On average the correlation between sites within 1kb drops by 0.01 to 0.02, after which it plateaus to a correlation reduction of around 0.005.Of course, additive effect of multiple mQTLs may reduce correlation further than this, but here we see limited evidence of the strongest SNP affecting correlation structure.This is, to our knowledge, the first illustration of the direct impact of genetic variants on co-methylation between nearby sites.This may suggest that cis co-methylation also depends on environmental effects, which may act on the same pathways and in the same direction as genetic effects (58).

Influence of environment on co-methylated sites
Environmental influences on individual DNAm sites have been separated as common (or shared) environment and unique environment (which includes measurement error) though twin studies (1).
Consistent with the high heritability estimates for sites that correlate r>0.9 in cis, the common and unique environment estimates for those sites are low.In trans, almost all sites which correlate with another DNAm site at r>0.9 are influenced on average 41% by common environment, and 49% by unique environment (which includes measurement error and interindividual stochastic variation).
Because DNAm sites involved in strong co-methylation in trans are influenced by environmental factors, strong trans co-methylation may be driven by environmental influences.

Overview
Because cis and trans sites correlated >0.9 have distinct features in terms of genetic and environmental influences, we conducted an in-depth analysis to identify the biological mechanism of this co-methylation.The number of correlations R>0.9 in each dataset are detailed in Table 3.
Throughout the following section for simplicity we show results for the ARIES 7 year olds, as results were very similar across all five datasets (See Supplement for other datasets).influences such as transcription factors to be hypomethylated (11,59), this would fit the behaviour of the highly trans-correlated sites seen here (see Figure 6 for illustration).Trans co-methylation structure The architecture of co-methylation between sites that are either distant or on different chromosomes has not previously been well characterised.We find that in all datasets these sites are distributed across the genome, and are interconnected (illustrated by the circos and network plots in Figure 7 and supplementary figures 8 and 9).There is a much higher number of correlations r>0.9 in BiB than in ARIES (as shown in Table 1 and the section on overall co-methylation structure).This can be explained by correlations in BiB being higher than in ARIES across the whole distribution (see Table 1 and Table 2); something that may be due to array or sample type effects, or phenotypic plasticity.The connections between the sites correlating r>0.9 resemble scale-free networks (60,61) with a small number of 'hub' nodes having large numbers of connections.All datasets except ARIES 15-17 years have a power-law alpha between 2 and 3, which is indicative of a scale-free network (60, 62) (however at 15-17 years the number of nodes is too small to assess scale-free properties appropriately).The degree distribution plots are found in supplementary figure 7.
To test the preservation of network architecture for DNAm sites correlating in trans >0.9, we tested preservation of both the DNAm sites (nodes) and the connections between them (edges) between the five datasets.To do this we utilised the Cytoscape Network Analyzer to test the intersection between pairs of networks.Using the Binomial test we find strong evidence of preservation of both the nodes and the edges of the networks for all four comparisons; between ARIES and BiB white British cord blood, between BiB white British and Pakistani cord blood, between birth and 7 years in ARIES, and between 7 years and 15-17 years in ARIES.Results are detailed in Table 4.
Figure 7: A: circos plot illustrating genomic distribution of trans correlations r>0.9 in ARIES 7 year olds (circos plots for all datasets can be found in supplementary figure 8).B: cytoscape network plot illustrating network connectivity of trans correlations r>0.9 in ARIES 7 year olds (cytoscape network plots for all datasets can be found in supplementary figure 9).

Enrichments of cis co-methylated sites
To illustrate the potential utility of strong cis co-methylation, enrichment analyses were conducted to assess the enrichment for DNAm sites being located within specific transcription factor binding sites (TFBS), chromatin states, and genomic regions.All pairs of DNAm sites correlating R>0.9 and within 1Mb of each other were included in this analysis.Enrichments were virtually identical for all 5 datasets.Cis correlating sites are strongly enriched for chromatin states associated with poised promoters (PromP; OR=3.9 to 4.8, p=1.1e-09 to 2.4e-17), and less strongly enriched for weak enhancers (EnhW1; OR=2.1 to 2.7, p=1.4e-03 to 8.4e-05, and EnhW2; OR=1.9 to 2.2, p=0.03 to 0.02).(shown in supplementary figure 11).As such, co-methylation of trans sites may relate to active transcription.This analysis was performed with the 860 inter-chromosomal trans correlations r>0.9 in the ARIES 7 year olds, and the 3577 in the BiB white British dataset.We find a strong enrichment of interchromosomal Hi-C contacts in the real correlation data as compared to 1000 permutations of the data, with no permutation set having a higher count of overlaps than the real data in either ARIES or BiB (n=46 in ARIES and n=654 in BiB; p=<2.2e-308 for both datasets) (Figure 10).This suggests that correlation of DNAm sites across chromosomes is at least in part likely to be related to interchromosomal contacts, which would mean that coordinated methylation states between DNAm sites on different chromosomes could be functionally relevant to inter-chromosomal chromatin contacts.subgroups.It suggests that co-methylation of DNAm sites may be related to specific aspects of genome regulation, where sites in cis are fundamentally different to co-methylation of sites which are distant or on different chromosomes.This includes the novel observation that cis co-methylation structure is not generally substantially reduced when adjusting for the strongest cis-mQTL.

Cis correlating
Through enrichment analyses we find that co-methylation is likely to be related to short RNA transcription associated with RNA polymerase III in cis; this might indicate that co-methylation is driven by environmental effects or TFs, which could be correlated with genetic effects (58).We have demonstrated in humans that trans-correlating DNAm sites are likely to represent interchromosomal contacts or regulation by multiple transcription factors, and thus they are likely to represent shared regulation.This is consistent with recent published evidence of correlation between DNAm sites in inter-and intra-chromosomal chromatin contact regions in mice ( 16).
This work suggests that DNA co-methylation is relatively stable across the groups that we have included in this study.In all groups, co-methylation is weak (R <0.2,>-0.2) between the vast majority of DNAm sites (83-87%), with a greater proportion of cis than trans co-methylated sites having very strong (>0.9)co-methylation, illustrating the role of close physical proximity in co-methylation.To test how well co-methylation between specific sites replicated, we restricted to correlations >0.8 in each dataset, and found that the magnitudes of correlation for sites in cis and trans are similar between ARIES and BiB White British (cord blood), and between White British and Pakistani (cord blood) in BiB.Cis and trans correlations were also similar in ARIES between birth and age 7 (with cis and trans mean differences close to the null), and between age 7 and 15 there appears to be a slight increase in correlation.However, we acknowledge that some of these differences in means had wide confidence intervals.The stability we have identified is important because it means these DNAm sites might also be reliably co-methylated in other datasets -this would mean co-methylation structure could have the potential to be more broadly applied to EWAS and DMR analyses.However this would need to be tested in older age groups, in more social groups including a greater range of ethnic groups, and in individuals born and residing outside of the UK.
We demonstrate that although cis co-methylation is distance-based, there is a large degree of variation.This is likely to reflect the fact that we find strong cis correlations are related to genomic regions (most strongly promotors and 5'UTRs).Although we show that DNAm sites that have strong cis correlations are highly heritable and are associated with cis mQTLs, we have shown that adjusting for the strongest cis mQTL does not substantially impact cis co-methylation structure; this is consistent with co-methylation structure not mirroring LD.Although we show there are differences in co-methylation between the five datasets we use in this study, we found that the biological meaning behind cis co-methylation structure is consistent across studies.From our enrichment analysis it appears that cis co-methylation is enriched for being located only at binding sites of transcription factors essential for RNA polymerase III transcription; this is supported by the recent demonstration that cis mQTL SNPs overlap TFBS (7,56).Coordinated DNAm states of locations in close proximity are therefore likely to be involved in regulation of the transcription of short RNAs involved in essential cellular functions (such as protein synthesis and transport) (65,66), and, most relevant to the datasets we use in this study, RNA polymerase III transcription has been shown to be a determinant of growth of both the cell and the organism (67).
We also found consistent biological enrichment between datasets for DNAm sites at least 1Mb apart that are strongly co-methylated.Strong correlations between DNAm sites on different chromosomes are shown here to be strongly enriched for inter-chromosomal contacts.Coordinated methylation states have been shown previously between contacting inter-chromosomal regions of the genome in mice (16); we demonstrate that this phenomenon can be identified using existing DNAm microarray and Hi-C data in humans.This is very much in line with recent work which shows co-methylated DNAm sites on the same chromosome are enriched for chromosomal loop contact sites (36).One might in fact expect the true number of inter-chromosomal contacts between highly co-methylated inter-chromosomal DNAm sites to be higher than reported here, as methods such as Hi-C do not pick up many inter-chromosomal contacts due to the greater distance between them than between cis contact sites (68).The enrichment for a multitude of TFBS suggests inter-chromosomal comethylation may be related to transcription factor networks, which have key roles in genome regulation (69) and have been shown to mediate inter-chromosomal chromatin contacts (70).
Finally, highly co-methylated trans sites are influenced almost entirely by non-genetic factors -that makes co-methylation an interesting factor to study in relation to environmental influences on genome regulation, given the chromatin and transcription factor enrichments for these sites.
We acknowledge that there are some limitations to this work.The first is that our datasets only span birth to adolescence; we do not know how applicable our results will be to adults.Differences between cohorts (ARIES and BiB cord blood datasets) are likely to have been influenced by the use of different array platforms, smaller numbers of BiB participants, and diversity of sample types in ARIES and phenotypic plasticity which may resulted in more correlations >0.5 in both groups of BiB than in ARIES.This study only considered DNAm in blood -DNAm is cell-type-and tissue-specific, and as such the applicability of these results to DNAm in other tissues may be fairly limited (71); although it is likely to be better for the highly heritable strong cis correlations (1).Of particular importance is a demonstration that trans-chromosomal promoter contacts have been shown to be cell-type specific (72), and so future work may benefit from utilisation of less heterogeneous cell type populations than blood or advanced deconvolution methods to investigate the functions of trans-chromosomal co-methylation.Finally, our study utilised DNA methylation arrays, which measure only 2-4% of DNAm sites and contain an over-representation of specific types of genomic regions, including CpG island regions, promotors, and enhancers.Whilst we have corrected for this overrepresentation in our downstream analyses, future work needs to ascertain whether these conclusions hold in genome-wide data.

Conclusions
DNAm has a stable co-methylation structure in humans that persists to at least adolescence and across social groups (here represented by both cohort and ethnicity).Cis co-methylation is likely to be related to short RNA transcription that is associated with RNA polymerase III.Trans comethylation is highly enriched in regions of inter-chromosomal contacts, and for the binding sites of multiple transcription factors, suggesting that co-methylation may have a role in 3D genome regulation.

Participants
Data were taken from participants of two birth cohorts: the Avon Longitudinal Study of Parents and Children (ALSPAC) (63,64), and the Born in Bradford study (BiB) (73) (Table 5).Detailed descriptions of the cohorts can be found in the Supplementary Methods (63,64,73);.In brief, ALSPAC is a multigenerational cohort study based in the Bristol area, comprising mostly white British participants.The original cohort were 14,541 pregnancies with a predicted delivery date between April 1991 and December 1992.A subsample of participants (known as ARIES) of 1022 mother-child pairs had DNAm data generated at five timepoints: birth, 7 years and 15-17 years in the children, and during pregnancy and 12-18 years later in the mothers.Our study utilises the three child timepoints only.
BiB is a longitudinal, multi-ethnic cohort study based in Bradford, UK.The original cohort were 13,776 pregnancies with a predicted delivery date between March 2007 and November 2010.Like ALSPAC it was set up to investigate factors which influence child health and development, but with a particular focus on child morbidity and mortality, as rates of these have been higher in Bradford than the rest of the UK (73).Bradford has a high rate of economic deprivation -one third of the neighbourhoods in Bradford are in the most deprived 10% of neighbourhoods in England (74,75).
Around 20% of the population are of South Asian descent, and 90% of these individuals are of Pakistani origin (73); the BiB DNAm subsample was specifically designed to be multi-ethnic, and so of those eligible, 500 White British and 500 Pakistani mothers were selected to have DNAm generated for themselves and their children.Correlation of all sites on the 450k array DNAm sites were correlated using the biweight mid-correlation (83-86), a median-based method.As Pearson correlation is mean based it may not be suitable for DNAm data, which may be influenced by genotype and so form clusters.In ARIES, a correlation matrix of 394,842 x 394,842 yields 77,949,905,061 unique correlations when we remove the diagonal of the matrix.In BiB there were 68,374,355,910 unique correlations.Due to size limits in R it is not possible to create a single matrix containing all pairwise correlations.As a solution, the DNAm sites were split into blocks of 25,000, and all blocks were correlated against each other.To assess the features of correlating pairs, the correlations were then split by value, from -1 to 1, in increments of 0.1.This enabled analysis of the features of all correlations of different strengths (ie, do high correlations differ from low correlations?).The process is summarised in Figure 11 and code is available here: (https://github.com/shwatkins/PhD/tree/master/450k_correlation_analysis)

Estimation of decay of cis correlations
To assess the decay of the cis correlation structure, decay plots of cis correlations were created across all chromosomes, and for each chromosome separately based on the method used in (32).All correlations where the DNAm sites on the same chromosome were within 10kb were extracted from the data.As we hypothesised that negative correlations may not have the same structure as positive cis correlations, positive and negative correlations were separated.The pairwise correlations were then binned; to display all chromosomes, specifying at least 4000 pairwise correlations per bin for positive correlations into between 469 and 578 bins, and 1000 pairwise correlations for negative correlations into between 800 and 827 bins.Once the correlations were grouped into bins, the mean pairwise correlation for each bin, the standard deviation of the correlation values in each bin, and the median pairwise distance between the correlating pairs of DNAm sites per bin, were calculated (supplementary figure 2).

Preservation of correlations across datasets
Mean difference plots were used to assess the preservation of high correlations between ARIES and BiB, and between the two ethnic groups in BiB.For each pair of DNAm sites, the mean of the correlation of the two groups is plotted against the difference in correlation between the two groups.95% confidence intervals are calculated using the difference, illustrating confidence in the mean difference estimates (87).

Heritability and environmental contributions to co-methylation
To assess the impact of heritability on DNAm correlations, the estimates of heritability and environmental influences on DNAm created by (1) were used to estimate the proportion of sites in each correlation band that were influenced by genetic, unique environmental, and shared environmental factors.The contribution of these influences were assembled for a unique list of all DNAm sites which featured in each correlation range (-1 to 1, in increments of 0.1).

Overlap of mQTLs with strength of co-methylation
We identified whether, for each correlating pair, neither, one or both of the DNAm sites were associated with an mQTL.For each correlating pair, each DNAm site was assigned 0 if it was not associated with any SNPs in the GoDMC dataset (55), and a 1 if there were one or more SNP associations.The value was summed for the two sites in a correlating pair, which resulted in 0 if neither DNAm site was associated with a SNP, 1 if only one of the DNAm sites was associated with a SNP, and 2 if both DNAm sites were associated with a SNP.This was done separately for cis and trans correlations, and totalled over all pairs in each correlation range, to illustrate the distribution of mQTLs across values of correlation.Please note that this does not identify whether both DNAm sites are associated with the same mQTL.

Removing mQTL influence from cis correlations
To illustrate some of the impact mQTLs have on co-methylation, we adjusted the cis correlation decay plot for the strongest cis mQTL associated with each DNAm site, thereby removing the strongest single genetic influence on DNAm correlations.For this analysis, we used chromosome 10 as an example.The analysis was performed adapting GoDMC analysis scripts (https://github.com/MRCIEU/godmc).This analysis uses an allele count file and a SNP frequency file created through plink2 (88) to adjust the DNAm matrix for the strongest cis-mQTL for each DNAm site, using an additive model.The residuals were then taken forward to calculate the decay of the correlations.The DNAm values which did not have an associated mQTL were not adjusted and were not included in the cis decay plot.

Enrichment analyses of strongly co-methylated sites
To identify whether DNAm sites which form strong correlations overlap with genomic sites of interest, we used the locus overlap package LOLA version 1.10.0(89).LOLA assesses enrichment based on genomic regions rather than genes.A list of unique DNAm sites correlating >0.9 formed the test dataset; all sites in the analysis formed the background.Because binding of transcription factors is enriched in GC-rich areas of the genome, the content of the background set was reduced and matched to the GC content of the test set using frequency quantiles.Genomic locations of the DNAm sites were taken from the IlluminaHumanMethylation450kanno.ilmn12.hg19R package version 0.6.0(90).Start and end sites were computed as -500bp and +500bp from the DNAm site position, respectively.A radius of 1kb was thought to be appropriate overlap because DNAm sites within 1-2kb are highly correlated.We used region sets created by the LOLA team (

Assessing trans correlations for chromatin contacts
Previous work (16) has illustrated that inter-chromosomal regions which connect have correlated DNAm states.To test whether highly correlating regions are enriched for chromatin contacts, we adapted the GoDMC analysis pipeline (https://github.com/MRCIEU/godmc_phase2_analysis/tree/master/13_hi-c) to test for chromatin contact enrichment, using a publicly available chromatin contacts map (92).For each of the pairwise contacts in the dataset (92) a 1kb region for the two contacting areas of the genome was generated.
These are split into files containing all interactions for all possible pairs of chromosomes (for example all contacts between chromosome 1 and chromosome 18).This resulted in 231 files containing all inter-chromosomal contacts on the autosomes.Next, the inter-chromosomal correlations r>0.9 in the relevant dataset (e.g.ARIES 7 year olds) had a 500bp region defined either side of the DNAm sites.We identified which correlating pairs overlapped with the contact regions (92).
To ascertain whether there were more contacts in our data than expected by chance we created a permuted dataset, where from the original data the second DNAm site in the highly correlating pair is replaced randomly with another in the dataset.Broken pairs that match a pair from the original dataset were removed, as are duplicates to avoid double counting.Then overlaps with the HiC data (92) were calculated for the permuted data.This process was repeated 1,000 times, the permuted datasets were merged together, and a distribution of overlap counts was created for the permuted data.This distribution was used to create a p value for the overlap of the permuted distribution chromatin contact overlaps with the number of overlaps in the real data.

Figure 1 :
Figure 1: Bar plot showing the distribution of biweight mid-correlation values between all filtered DNAm sites

Figure 3 :
Figure 3: A: Density plots illustrating the proportion of DNAm variation due to heritability for DNAm sites with

Figure 4 :
Figure 4: (A): Bar plots of the percentage of pairwise correlations in each correlation range that have 0, 1 or 2

Figure 5 :
Figure 5: Density plots illustrating the proportion of DNAm variation due to common environment and unique

Figure 6 :
Figure 6: A: Density plot showing the mean standard deviation of cis and trans DNAm sites with correlations DNAm sites are strongly enriched for a select few TFBS in blood: RNA Polymerase III (Pol3) (OR=10.3 to 18.7, p=2.4e-05 to 4.6e-09), BRF1 (OR=10.5 to 18.5, p=2.9e-05 to 5e-09), and BDP1 (OR=6.8 to 11.4, p=2e-04 to 1.5e-06), with weaker enrichment of TFIIIC-110 and RPC155.Heatmaps of TFBS and chromatin enrichments are shown in Figure8.Cis sites showed strongest enrichment for location in promoters (OR=1.62 to 2.5, p=6.2e-06 to 1.1e-17) and 5'UTR (OR=1.6 to 2, p=0.02 to 9.8e-09), with weaker enrichment in exons (OR=1.4 to 1.6, p=0.02 to 9.7e-06) (shown in supplementary figure10).As BRF1 and BDP1 are essential for RNA polymerase III transcription, these enrichments suggest that coordination of methylation state of sites in close proximity is primarily a feature of active promoter regions involved in regulation of short RNAs essential for cellular function transcribed by RNA polymerase III.

Figure 8 :
Figure 8: Enrichments of cis-correlating sites r>0.9 for A chromatin states and B transcription factor binding

Figure 9 :
Figure 9: Enrichments of trans-correlating sites r>0.9 for A chromatin states and B transcription factor binding

Figure 10 :
Figure 10: A: In ARIES the overlap in the permuted datasets is either 0 or 1, whereas there are 46 overlaps in

Figure 11 :
Figure 11: Overview of creating the correlation matrix and extracting pairwise correlations (numbers are

Table 1 :
Table of the numbers (percentages) of cis and trans biweight mid-correlations in each correlation band, comparing all ARIES and BiB datasets.

Table 3 :
Numbers of cis and trans correlations r>0.9, and the number of unique DNAm sites these correlations are between.Cis and trans sites that have correlations >0.9 display very different patterns of variance.Cis sites are more variable in both mean methylation level and variance, although there is a tendency for hypoand hyper-methylated sites to have smaller variance.Trans correlated sites are typically hypomethylated, and have low variance.As one might expect DNAm sites under trans-acting

Table 4 :
Node and edge intersections for trans-correlating DNAm sites r>0.9.Nodes are DNAm sites and edges are the specific pairwise correlations between two nodes.Intersection means the node, or the specific correlation between two nodes, is present in both datasets A and B. N = number of intersections; p-values assess whether there are more overlaps than expected by chance, using the Binomial exact test.

Table 5 :
Overview of study participants and sociodemographic characteristicsBiB participants were genotyped using either the Illumina HumanCoreExome Exome-24 v1.1 microarray, or the Infinium global screen-24+v1.0array.GenomeStudio 2011.1 was used to preprocess samples; exclusions and imputation procedures are detailed in the supplementary methods.