Imputed DNA methylation outperforms measured loci associations with smoking and chronological age

Multi-locus signatures of blood-based DNA methylation are well-established biomarkers for lifestyle and health outcomes. Here, we focus on two CpGs that are strongly associated with age and smoking behaviour. Imputing these loci via epigenome-wide CpGs results in stronger associations with outcomes in external datasets compared to directly measured CpGs. If extended epigenome-wide, CpG imputation could augment historic arrays and recently-released, inexpensive but lower-content arrays, thereby yielding better-powered association studies.


Main
Over the past decade, Illumina array technology has led the profiling of DNA methylation (DNAm) in large cohort studies 1,2 .This began with the measurement of ~27,000 CpG loci (27K array), followed by ~450,000 (450K array) and most recently 800,000+ CpG sites (EPICv1 and EPICv2).The vast majority of content on the smaller 27K and 450K arrays are also contained on the EPIC arrays 3,4 .By leveraging the widespread correlations across the methylome 5,6 , it may be possible to derive an imputation framework to augment existing datasets with missingness or those that were generated using historic arrays.
The requirement for such a tool is further emphasised by the recent development and launch of the Illumina Methylation Screening Array (MSA) 7 .The MSA is the most affordable DNAm array to date and was designed specifically for application in large biobank studies.The MSA array assesses DNAm at 269,094 sites of which 145,318 are present on the EPICv2 array 8 .The MSA therefore contains new content with enrichment for sites linked to regulatory and cell-specific chromatin states 8 .Clearly, this will create major challenges when trying to replicate findings or meta-analyse data across array technologies.However, if one can identify ways to impute content, this will not only benefit cohorts with existing data, but will also afford an opportunity to assess DNAm at greater scale, via a less expensive method, prior to boosting content through imputation.Finally, given that imputation considers CpG information from multiple loci, this averaging process may lead to fewer spurious outliers at well-imputed sites, resulting in better-powered association studies and improved multi-CpG biomarkers, such as epigenetic clocks 9 .
Building an accurate DNAm imputation server is therefore of immense value to the research community.However, developing this tool is non-trivial both in terms of scale (~1 million unique CpG sites to consider) and determining imputation quality for traits that are both dynamic and influenced by multiple factors.
To highlight the potential of the work, we present pilot findings for two CpG loci that are established blood-based correlates of chronological age (cg16867657, ELOVL2) 10,11 and smoking behaviour (cg05575921, AHRR) 12,13 .Similar to polygenic risk scores, methylation-based predictors can be derived as linear, weighted additive combinations of CpGs.Hereafter, we refer to these as Epigenetic Scores (EpiScores).EpiScores for the two CpGs were derived using data from 18,869 volunteers from the Generation Scotland cohort.DNAm was profiled from blood samples collected between 2006 and 2011, when individuals were aged between 17 and 99 years (11,098,   58.8% female, Supplementary Figure 1).The EPICv1 array was used to measure DNAmfull details of the processing and quality control are presented elsewhere 14 and briefly summarised in Online Methods.There were methylation estimates available for 752,722 CpGs after quality control.These were subset to loci present on the 450k array to maximise backwards compatibility, and further filtered to the target locus (cg16867657 or cg05575921) and the 200,000 most variable probes (after excluding the target CpG) for computational efficiency and to remove invariant CpG sites.In a final quality control step, each CpG Mvalue was pre-adjusted for sex, analysis batch and the first 10 genetic principal components 15 via linear regression in R with the resulting residuals taken forward for the main analysis.
Elastic net penalised linear regression was used to derive EpiScores using the biglasso package (version 1.3.7) 16in R (version 4.0.3).The target CpG (cg16867657) was specified as the outcome variable with the 200,000 most variable CpGs as the predictors.This was then repeated with cg05575921 as the outcome.20-fold cross-validation was applied to obtain the optimal lambda (shrinkage parameter) that minimised the mean error.The subsequent models resulted in 65 non-zero coefficients for both cg16867657 and cg05575921.These coefficients are presented in Supplementary Table 1.
The predictors were then tested in external datasets.The age-related CpG EpiScore was tested in the publicly available dataset used by Hannum et al. 17 to derive one of the first epigenetic clocks (GSE40279, 450K array).After downloading the data (n=656 individuals aged 19 to 101 years), the model weights were applied (65/65 CpGs present) and the EpiScore was derived.The measured CpG and CpG EpiScore were highly correlated with each other (Pearson r = 0.90, P = 8.1x10 -238 ). Figure 1 shows that the imputed CpG EpiScore yielded a stronger, more significant correlation with chronological age than the measured CpG: Pearson r EpiScore = 0.88 (P = 8.4x10 -214 ) versus r cg16867657 = 0.83 (P = 7.4x10 - which contained 500 individuals from the Mass General Brigham (MGB) Biobank was also considered.This cohort was evenly divided by sex and representing ages 18-99 years and a broad range of ethnicities.The dataset was subset to 437 individuals with no missing data for age or the 59 CpGs.Here, the CpG -EpiScore correlation was 0.96 (P = 2.0x10 -239 ) with age correlations of r EpiScore = 0.94 (P = 3.7x10 -204 ) and r cg16867657 = 0.93 (P = 1.6x10 -185 ).When the cohort was stratified into age decades, the EpiScore -CpG correlation decreased (r range = 0.64-0.81)but remained larger than the within-strata associations between either variable and age (Supplementary Table 2).
The smoking-related CpG EpiScore was tested in 895 individuals from the Lothian Birth Cohort 1936 19 .All individuals were born in 1936 and had a mean age of 70 years (SD = 0.8) with 442 (49.4%) females when blood samples were obtained.DNAm was profiled from these samples using the Illumina 450K array 20 .Details of quality control are presented in the Online Methods.Smoking pack years information was obtained by self-report questionnaires and calculated as years smoked (age stopped minus age started smoking) multiplied by the number of 20-cigarette packs smoked per day.This information was available for 881 of the 895 participants and underwent a log(pack years + 1) transformation to reduce skew.64/65 CpGs were available in LBC1936 for the projection of the EpiScore.
The Pearson correlation between the measured cg05575921 and its EpiScore was r = 0.87 (P = 4.8x10 -275 ) in the whole population and r = 0.61 in the sub-group of n = 410 neversmokers.This is plotted in Figure 1 alongside boxplots of the measured CpG and EpiScore against self-reported smoking status.The measured CpG and its EpiScore associations with smoking pack years were r = -0.62 (P = 8.8x10 -94 ) and r = -0.60 (P = 2.4x10 -87 ), respectively.
The EpiScore also showed a better classification of current versus never smokers (assessed by self-report, n = 109 and 410, respectively) in the same population: area under the receiver operating characteristic curve of 0.971 versus 0.954.A sensitivity analysis training the EpiScores using DNAm beta-values in place of M-values made minimal differences to the results (Supplementary Table 3).
Together, these findings show that imputation of CpG methylation from other CpG sites leads to stronger and more statistically significant associations with two important outcomes for health research: age and smoking.While the imputation success at the selected sites is part-driven by their well-established associations with age and smoking, these findings militate for further work to assess how well the approach generalises across all CpGs present on Illumina arrays.In addition, family-structure/relatedness was not accounted for within the Generation Scotland training cohort, which may have led to information leakage across folds and overfitting.However, we tested the resulting EpiScores in external datasets where the DNAm was also processed and normalised independently.Further tests need to be carried out to ensure that the resulting signatures translate across diverse populations.
Here, the LBC test cohort contained individuals of Scottish ancestry while the Hannum dataset contained a mixture of European and Hispanic ancestry individuals (n = 426 and 230, respectively) and GSE246337 contained a mix of European-, African-, Asian-, and Hispanic-ancestry individuals.Subsetting to CpGs that are commonly found on the 450K, EPIC and MSA arrays prior to training EpiScores would maximise the gains for all cohorts.
Further subsetting this list to loci that have similar patterns (e.g., mean and SD by age and sex) across populations, as well as exploring the properties of well-imputed sites (e.g., by genomic location or SNP-based heritability) will further inform the generalisability of the findings.Future studies should also focus on incorporating genotypic contributions to CpG variability 21 or more flexible imputation approaches that can capture non-linear patterns.
In conclusion, the imputation of array-based CpG methylation and plasma proteins is feasible and can lead to larger and more statistically significant effect sizes in association studies for complex traits.