Re-analysis of publicly available methylomes using signal detection yields new information

Hafner, Alenka; Mackenzie, Sally

doi:10.1038/s41598-023-30422-4

Download PDF

Article
Open access
Published: 27 February 2023

Re-analysis of publicly available methylomes using signal detection yields new information

Scientific Reports volume 13, Article number: 3307 (2023) Cite this article

1424 Accesses
9 Altmetric
Metrics details

Subjects

Abstract

Cytosine methylation is an epigenetic mark that participates in regulation of gene expression and chromatin stability in plants. Advancements in whole genome sequencing technologies have enabled investigation of methylome dynamics under different conditions. However, the computational methods for analyzing bisulfite sequence data have not been unified. Contention remains in the correlation of differentially methylated positions with the investigated treatment and exclusion of noise, inherent to these stochastic datasets. The prevalent approaches apply Fisher’s exact test, logistic, or beta regression, followed by an arbitrary cut-off for differences in methylation levels. A different strategy, the MethylIT pipeline, utilizes signal detection to determine cut-off based on a fitted generalized gamma probability distribution of methylation divergence. Re-analysis of publicly available BS-seq data from two epigenetic studies in Arabidopsis and applying MethylIT revealed additional, previously unreported results. Methylome repatterning in response to phosphate starvation was confirmed to be tissue-specific and included phosphate assimilation genes in addition to sulfate metabolism genes not implicated in the original study. During seed germination plants undergo major methylome reprogramming and use of MethylIT allowed us to identify stage-specific gene networks. We surmise from these comparative studies that robust methylome experiments must account for data stochasticity to achieve meaningful functional analyses.

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations

Article Open access 09 April 2024

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Introduction

Cytosine methylation has been described as the fifth letter of the genetic code¹, bestowing an additional level of information to the genome through regulation of gene expression^2,3 and physical conformation of the DNA molecule⁴. However, the language of the methylome ⁵ has remained much more elusive to interpretation than that of the underlying nucleotide sequence. If we follow the DNA code analog, we must achieve proficiency in methylome decoding in at least three ways: (i) reading at single site resolution, (ii) interpretation of downstream effects of single position methylation status, and (iii) understanding the meaning of different methylation patterns in their local and global DNA context. While it is currently feasible to read cytosine methylation at single base resolution, owing to advances in whole genome sequencing through bisulfite conversion⁶, and to interpret high-density methylome changes, such as in transposable elements (TEs) during major developmental events or in methylation machinery mutants^2,3,7, our proficiency in decoding of the methylome remains lacking.

Much of the early analysis of methylome variation focused on high-density methylation changes within defined intervals across the genome^8,9,10. This methodology was likely a consequence of experimental emphasis on datasets deriving from DNA methylation machinery mutants, which produced extremely robust methylation signal^2,11. These units of change were termed differentially methylated regions (DMRs) and varied across studies for window size, requisite differentially methylated position (DMP) number, and uniform directionality for hyper/hypo-methylation^12,13,14. DMR analysis can identify genomic sites likely to undergo, or be released from, gene silencing¹⁵ and, therefore, often produce datasets rich in TE and heterochromatic genomic intervals. With these analytical approaches, however, the function of low density, intragenic methylation repatterning during development and in response to environmental stimuli remains unapproachable. While there is an extensive understanding of the evolutionary and mechanistic origin of gene body methylation (GbM)¹⁶, debate on its functionality remains¹⁷.

The plant methylome is often described as stochastic¹⁸. Variation is thought to arise from thermodynamic fluctuations of cellular machinery and the DNA molecule itself^{5,19,20,21,22}. There is inherent stochasticity to methylome remodeling with each cell division, giving rise to ‘spontaneous epimutations’ in all methylation contexts^18,23,24. Noise in BS-seq datasets is also amplified by tissue pooling as the plant epigenome appears to be developmental stage-, tissue- and cell-type specific^{25,26,27,28,29,30,31}. Therefore, a full picture of environmental or developmental epigenetic responses is only possible by separating treatment signal from noise in the system. This type of approach would make interpretation of methylome data feasible at the organ or organism level, even as single cell BS-seq becomes feasible in plants³², as it is in animals³³.

Stochastic methylome variation is often dismissed as information-free noise, inherent to the study of biological systems³⁴. However, recent advances in computational biology demonstrate that interpreting existing methylome data at single read³⁵ or single cytosine^32,36 resolution yields new information. Single cytosine changes in methylome patterns have been shown to have important phenotypic effects^37,38 and we are beginning to recognize the importance of gene body methylation in plants^16,39. These observations point to a gap between the prevalent methodology and our current understanding of methylome biology.

Signal detection with machine learning is one approach to methylome data that bridges this gap. The MethylIT tool, based on physical statistics approaches, has been effective in identifying treatment-associated differential methylation in multiple study systems^5,36,40,41. To test the relative efficacy of a signal detection-based method versus conventional methods for methylome data analysis, datasets produced in studies with robust experimental design and depth of sequencing can be re-analyzed with MethylIT, potentially increasing their utility. Here, we apply MethylIT, a signal detection pipeline, to two open access datasets of high quality; a study of methylome remodeling in response to phosphate starvation⁴² and an analysis of epigenetic changes during normal seed germination²⁹. This report aims to demonstrate that applying novel approaches to existing high-quality methylome data can reveal meaningful additional information.

Results and discussion

Using signal detection approaches in methylome analysis avoids arbitrary filtering and accounts for stochasticity

Bisulfite deamination of DNA, by conversion of cytosine into uracil to be read as thymine, combined with whole-genome sequencing, has enabled the study of methylome variation in many organisms where an assembled reference genome is available. The cost of whole genome bisulfite sequencing (WGBS) continues to decline, enabling plant biologists to add epigenetic components to developmental and environmental response studies.

Upon completion of the bisulfite sequencing (BS-seq) runs, primary data must be converted to methylation counts for each cytosine in the genome. The first step, an overall quality check and the trimming of sequencing adaptors, is most commonly conducted with Trim Galore!, which performs both functions⁴³. The resulting short sequence reads are then aligned to the reference genome based on the three different bases that result from bisulfite conversion. A prevalent aligner for this step is Bismark⁴⁴, and the output file contains methylated and unmethylated counts for each cytosine in the genome as well as its context.

Selection of an analysis pipeline occurs at this stage, which can make a fundamental difference to final data output, both for differentially methylated positions (DMPs) and regions (DMRs). There are more than 20 bioinformatic tools available¹⁴ and they differ significantly in the statistical approaches implemented²⁵. Without advanced statistical mathematics training, it is often easiest to adopt the most user-friendly, but not necessarily most powerful, options. Figure 1a compares MethylIT (a signal-detection pipeline) to the generalized pipeline of multiple other approaches. Figure 1b shows the methylation counts with DMRs (hierarchical clustering approach, blue panel) and treatment-associated DMPs (MethylIT, green panel) for two genes in a seed germination study²⁹. This example demonstrates both the divergence of results when different methods are used and the bias of using a DMP-density-based approach (Fig. 1b, right).

The number of samples or biological replicates varies widely between methylome studies. The limiting factors include the cost per sample for WGBS and the amount of tissue needed (seedlings are often pooled into one sample). Whereas many common methylome analyses (Fig. 1a, blue panel) require only two samples (one control and one treatment), signal detection approaches require multiple (generally 3–5) biological replicates for each experimental condition to allow assessment of background noise (stemming from thermodynamic fluctuation) in the control group. It would not be valid to compare two samples with no replication using signal detection because treatment-associated DMPs would be indistinguishable from DMPs arising from stochastic variation. For this reason, control samples are pooled, and their centroid becomes the reference methylome, representing the background noise not associated with treatment.

The first step in identifying differentially methylated sites from all methylation counts is deciding which cytosines provide sufficient coverage to be included in further analysis. This step is largely arbitrary, and the importance of applying a noise-filtering solution here has been discussed before¹³. However, utilizing a smoothing kernel assumes that neighboring sites/regions exhibit correlated methylation levels, which holds only for CG and CHG but not CHH methylation contexts¹³. Both pipelines described in Fig. 1 use an arbitrary cut-off filter for coverage, followed by the identification of differentially methylated positions.

Some pipelines, including MethylSig, DMRcaller, and BSmooth^13,46,47, implement tiling bins at this stage. This partitioning of the genome into arbitrary-sized regions pools reads to amplify small methylation level differences to significant levels. Most methods apply an appropriate statistical method after filtering for coverage, primarily Fisher’s exact test, beta-binomial or logistic regression when multiple replicates are considered^13,46,47,48. This step yields a list of potential DMPs that are differentially methylated at the level of significance desired. “True” DMPs are a subgroup of potential DMPs that satisfy arbitrary criteria for differences in methylation levels between samples being compared. This cut-off varies widely between tools and users, ranging from 10 to 40%^12,13,14, and often depends on the cytosine methylation context frequency distribution to minimize some of the bias. After a final list of DMPs is obtained, an optional step is a correction, in the event of greater variability in data than is assumed by the distribution, by determining the treatment effect in the statistical model. What follows is defining a DMR according to the number or density of DMPs present in a region. A region can be a genomic feature, a binned window size (e.g., 100 bp), or a sliding window. However, deciding on the number of DMPs sufficient for a DMR is a controversial step and another arbitrary filter.

MethylIT uses an information thermodynamics-derived approach to provide greater resolution of treatment-associated signals without consideration of DMP density ⁵. The first step distinguishing this pipeline calculates the difference in methylation levels between all given samples, including the controls and the reference (centroid of controls) methylome. In MethylIT, differences are estimated in the form of Hellinger divergence and total variation distance. Next, the divergences are modeled for all samples based on a generalized gamma probability distribution model, the fitted probability distribution, yielding separate control and treatment group probability distributions⁵. Others have also proposed using information theory approaches for analyzing methylome³². In MethylIT, the potential treatment-associated signal is above the 95th percentile of the fitted probability distribution of the methylation, which becomes the initial cut-off for each individual sample. The following step uses machine learning to distinguish true DMPs by estimating the optimal cut-off sufficient for the classification of DMPs into control and treatment (separately for all three methylation contexts, using several performance metrics). Conversely, MethylIT assesses each potential DMP based on the fitted probability distribution of methylation in each biological replicate separately, accounting for methylation heterogeneity and inherent stochasticity^40,41. This type of filtering avoids the application of arbitrary cut-offs that can eliminate meaningful DMPs or include DMPs that are the result of stochastic fluctuation, both impacting outcomes.

Methylome analysis ends with feature annotation. The prevalent method yields a list of DMRs of arbitrary size that contain an arbitrary number of DMPs validated in the previous step. MethylIT also provides a list of DMRs; however, these are identified with the best generalized linear model selected from Poisson and Negative Binomial regression analyses, comparing control and treatment groups, and filtered with minimum methylation counts per cytosine and individual. The final step in both methodologies is to overlap or assess the proximity of DMR genomic addresses with genetic features of interest, including genes, TEs, and exons. Whereas prevalent methods look for overlaps of DMRs with annotated features, MethylIT statistically assesses all features of the selected category (i.e., genes, TEs, exons) as potential DMRs with no fixed set of DMRs to overlap with annotated features.

The resulting number of differentially methylated genes (DMGs) is generally higher using MethylIT than with the prevalent methodology (Table 1), which may raise concerns about insufficient filter stringency. However, the downstream functional analysis combined with MethylIT yields the same core gene networks with or without the most stringent filter cut-offs¹⁴. Figure 2e and 3d demonstrate the modest overlap between the original studies’ DMGs and those from MethylIT, which is the result of MethylIT (i) excluding DMRs caused by stochastic variation, (ii) adding DMRs with lower density DMPs that are nevertheless treatment-associated, (iii) only searching for DMGs and not other differentially methylated genomic regions for the sake of intelligibility (e.g., a gene can include differentially methylated exons but was not identified as a DMG). Additionally, we did not increase stringency simply to decrease the number of output DMGs as that can exclude biologically meaningful results (Fig. 1b).

Table 1 Summary of DMG numbers identified in the original studies of phosphate starvation⁴² and seed germination²⁹ (DMR) and their overlap with DMGs identified using the MethylIT pipeline (SD).

Full size table

By comparing pipelines and their data outputs (Fig. 1), we can conclude that the different methods address fundamentally different questions about the methylome. The prevalent, “percentage of methylation”-gated methods endeavor to describe global changes in methylome levels. These changes generally include the percentage of methylated cytosines in each context, densely hypo- or hypermethylated or heterochromatic regions, TE silencing, and the global impact of epigenetic machinery mutants. MethylIT, as a signal detection method, aims to read global changes in methylome patterns at single cytosine resolution. The method, therefore, treats hypo- and hyper-methylation at each site equally, while evaluating each DMP based on the fitted probability distribution of each individual. The approach allows tracking of subtle changes in methylome repatterning without consideration of changes in methylome levels and discriminates variation within the treatment condition from the stochastic background effects present in both treatment and control conditions. Where most common approaches neglect to count DMPs that appear outside of “CpG islands”⁴⁹, which have been shown to have important downstream and phenotypic effects^37,38, the MethylIT procedure, by identifying only treatment-associated DMPs, incorporates all data regardless of methylation context or DMP density. Here, we focus our analysis on DMGs (not differentially methylated TEs or exons), particularly to demonstrate the importance of finding and interpreting intragenic methylation.

Signal detection combined with functional annotation analysis reveals additional information

Phosphate starvation

A 2015 study of epigenetic responses to phosphate starvation in Arabidopsis revealed that methylome repatterning occurs in response to low phosphate (Pi), with altered expression of a small number of differentially methylated genes responsive to low phosphate conditions⁴². We selected this study for re-analysis with the MethylIT signal detection pipeline based on several features of its experimental design. The depth of sequencing was sufficient, each condition had three individual biological replicates with no pooling, controls were rigorous and present in both short- and long-term starvation conditions (7 days and 16 days), and data were collected for roots and shoots separately.

We conducted several pairwise comparisons between different experimental treatments with MethylIT. In all computations, the reference methylome was pooled from three control samples (Fig. 1a), which were always the shorter or no starvation condition. For shoot and root datasets separately, we compared methylomes for 7 days of low phosphate with 7 days high phosphate, 16 days of low phosphate with 16 days high phosphate, and 7 days high phosphate with 16 days high phosphate. The latter analysis was done to identify differentially methylated genes that result from normal development so that they could be later subtracted from the low phosphate treatment datasets. This analysis, using standard settings (see Supplementary Table S1), yielded a higher number of DMGs than the original DMR-based analysis (see Table 1 and Supplementary Table S2). Yong-Villalobos et al. (2015) identified DMPs using an F-test and subsequently DMRs as regions where DMP density exceeded global DMP density. Yong-Villalobos et al. point out the arbitrary nature of defining a DMR and hence conduct further analysis using both DMRs and all DMPs. For valid comparison with MethylIT, we use their reported list of DMRs overlapped with TAIR10 genes (see Supplementary Table S2). As a standard part of our analysis, DMGs were functionally analyzed with DAVID GO⁵⁰ and Cytoscape⁵¹. To compare both methylome analysis pipelines, the DMGs identified in the original study were functionally analyzed alongside the MethylIT output.

Both analyses found tissue-specific methylome responses to phosphate starvation that were more pronounced after longer starvation treatment (Table 1). Figure 2a, b show that the pronounced biological processes identified among DMGs were similar yet distinct between the two pipelines. Cellular response to phosphate starvation, identified in both analyses, was only present as an enriched category using signal detection and was more pronounced when development-associated DMGs (differential methylation not associated with phosphate treatment) were subtracted. A subset of genes was unique to both analyses (Fig. 2e), which is the result of MethylIT only including treatment-associated DMGs and excluding stochastic variation. More DMGs overlapped with differentially expressed genes (DEGs) in the signal detection DMGs than DMR-DMGs from the original study (Supplementary Table S3). The GO term categories that were uniquely enriched in the DMR-based analysis largely contained DMGs that were also detected using MethylIT but were not the top categories in the MethylIT DMG context (Supplementary Table S4). For example, whereas heme transport was a pronounced category in the original study, its relative importance was diminished in the shoot with MethylIT, despite signal detection also finding these same heme-related genes. Both analyses yielded sulfate metabolism genes, which were also differentially expressed, but this category was not prominent in the original study dataset. A connection between sulfate homeostasis and phosphate starvation has been previously described^52,53.

MethylIT-derived shoot DMG network produced in Cytoscape confirmed the phosphate, sulfate, carbon fixation and photosynthesis DMGs as vital to the response, together with other stress-responsive genes (Fig. 2a, c). Root DMGs indicated a distinctive growth response to phosphate starvation, with anisotropic cell growth, longitudinal axis specification, cellulose biosynthesis, and protein localization to cortical microtubules. This network analysis also pointed to altered gene expression and a potential plastid response (2b, d). Core hub genes produced in Cytoscape from DMR-based analysis were lesser in number and did not match categories enriched in GO term analysis to the same extent (Supplementary Fig. S1, S2).

This sample comparison demonstrates the power of using signal detection, in combination with several functional analyses, to uncover treatment-associated and biologically meaningful differential methylation patterns. MethylIT confirmed the presence of biological processes identified in the original study and added additional resolution with meaningful connections to other processes. Crucially, the robust biological and developmental controls (sufficient replicates, tissue-specificity, and comparison of high phosphate treated plants after 7 and 16 days) allowed the staggering number of DMGs to be converted into a Pi-starvation-associated subset of manageable size. Network analysis allowed potential key players in this tissue-specific stress response to be revealed.

Seed germination

In 2017, three studies published in Genome Biology investigated methylome remodeling during embryogenesis and germination^27,28,29. Narsai et al. (2017) analyzed five developmental stages and reported progressive demethylation of the Arabidopsis genome during the seed-to-seedling transition. This transition coincided with changes in mRNA and siRNA populations. We selected this dataset for re-analysis with MethylIT based on its robust biological replication, depth of sequencing, and limited discussion of the functional identity of DMRs determined. BS-seq read files were provided for dry seed, seed after 48 h stratification, 6 h after light exposure, 24 h after light exposure, and 48 h after light exposure. The investigators conducted methylome analysis using HOME (v0.1) with default parameters for time series analysis, and the added cut-off for the difference in methylation levels was 20%. In contrast to the original study’s hierarchical clustering, MethylIT analysis was conducted pairwise, with consecutive stages of development serving as the reference (control) methylome for the identification of DMGs in each following stage. As the original study did not include DMG analysis, we overlapped their reported 12,654 DMRs with TAIR10 genes to obtain 891 DMGs (Supplementary Table S5). As in the phosphate starvation study, the seed germination data was likewise functionally analyzed with DAVID GO⁵⁰ and Cytoscape⁵¹ alongside the MethylIT output.

Using standard MethylIT settings (Supplementary Table S1), our re-analysis yielded a comparable number of DMGs to Narsai et al. (2017) in each individual stage (Supplementary Table S5), with the largest number of DMGs present in the first developmental transition and an overall larger number of unique DMGs (Table 1). More DMGs overlapped with differentially expressed genes (DEGs) and genes showing isoform variation during seed germination when the signal detection pipeline was used (Supplementary Table S6). The re-analysis identified the dry-seed-to-stratified-seed transition as the major stage of methylome repatterning in seed germination, both in the number of DMGs and the number of unique biological processes revealed by GO term analysis (Table 1, Fig. 3a, d, Supplementary Table S7). Accompanying the increase in a subset of DEGs and miRNAs at this stage²⁹, methylome changes were targeted to the regulation of gene expression and chromosome and chromatin remodeling (Fig. 3a). In the following three developmental transitions, we observed differential methylation in different GO-term categories, with germinating seed response to light exposure as prominent. Regulation of gene expression and chromatin organization declined in prominence as methylation targets, and methylation itself became an enriched pathway (Fig. 3a).

The identity of the key network hubs identified by Cytoscape also changed with each developmental stage (Fig. 3b, Supplementary Fig. S3-5). Only 22.6% of DMGs present in the first stage hub were also reported in the original study (Fig. 3c, circled in blue) with similar outcomes in other stages (Supplementary Fig. S3-5). In contrast, 80.4% of core hub DMR-DMGs in the original study were also identified with MethylIT (Fig. 3b, circled in green) Many of the key players (core hub genes) are also members of the DNA damage repair pathway, which is linked with chromatin remodeling in the seed⁵⁴. In the DMR-DMGs, regulation of gene expression and cellular metabolic processes were present as categories of enriched Biological Process GO terms, however, the resolution of the signal-detection analysis was not present, demonstrated by the smaller number of unique GO terms.

When expression data is overlapped with DMR data (Supplementary Table S6), the enriched GO terms in each developmental transition are more informative than each dataset individually (Supplementary Table S7). For example, the GO category “photosynthesis” is enriched threefold in DMGs identified by MethylIT after the seed is first exposed to light (stratified seed vs. 6 h in light transition), not enriched in DMR-DMGs or RNA-seq data individually but is fivefold enriched in the overlap of MethylIT DMGs and RNA-seq genes in the stratified seed versus 6 h in light transition, and threefold enriched after the next stage of 24 h in light.

This re-analysis again highlights how signal detection can add novel insights to existing datasets and increase their utility. Epigenetic reprogramming in the early stages of seed germination appears to be targeted to gene regulation and chromatin organization genes, which was not evident in the original analysis. As the seed is exposed to light, numerous genes related to a variety of metabolic processes related to photosynthesis become differentially methylated, including those responsible for cytosine methylation itself.

Considerations for meaningful methylome analysis

Statistically meaningful methylome analyses require data coming from well-designed experiments; a poor dataset cannot be salvaged by stronger signal detection and/or machine learning algorithms. Figure 4 summarizes several considerations for an experimental scheme that will yield methylome datasets that can be used in signal detection pipelines and are likely to be amenable to reuse. Considering the often-prohibitive cost of processing robust numbers of biological replicates for BS-seq, we advise at least three biological replicates in the control group be prioritized over more time points or treatment conditions. This replication of controls is vital for allowing the calculation of background variation caused by methylome stochasticity in absence of the investigated treatment. With MethylIT as a signal detection pipeline, it is not appropriate to estimate a reference methylome without three control samples. Although the same number of replicates for the treatment group is advisable in order to confer confidence to treatment-associated DMPs, MethylIT can run the analysis with just one treatment group replicate.

Methylome analyses can yield hundreds or even thousands of differentially methylated genes and finding biological meaning in them can require multifaceted analysis. Searching the datasets for genes associated with the experimental condition investigated, i.e. an ad hoc approach, may uncover a novel epigenetic component of the system but does not, in itself, reveal new connections. However, when GO-term enrichment analysis is accompanied by gene network modeling, a genome-wide picture of the methylome response is constructed. The integration of k-means clustering of protein–protein interaction networks combined with gene ontology enrichment analysis has been shown to be a powerful approach to interpreting epigenetic data^36,55. Figures 2a and 3a show GO-term enrichment obtained with the DAVID Functional Annotation tool⁵⁰. Figures 2b, 3b, and 3c show the Cytoscape output for core gene network hubs that were obtained through unsupervised machine learning k-means clustering from DMGs identified with MethylIT and STRING protein–protein interactions data with Cytoscape. Even though categories of terms overlapped with those from GO-term enrichment analyses, the core gene hubs were not the most GO term-enriched categories, pointing to the importance of multifaceted analysis. These examples from phosphate starvation and seed development datasets demonstrate the need for multi-omics functional analysis that takes advantage of existing knowledge about biological connections for the true decoding of the methylome.

Conclusions

We believe that signal detection, here applied in the form of the MethylIT, provides three key advantages over many prevalent methylome analysis approaches:

1.
The ability to resolve data to interpretable networks
2.
The ability to derive data with unambiguous outcomes (significance of enrichment or p-value discrimination) that provide confidence in conclusions with a minimum of “cherry picking”
3.
The ability to derive meaningful new information that was simply not available using conventional methods to now conclude new pathway connections for phosphate starvation behavior and clearer stage discrimination for epigenetic and developmental transitions during seed development

The above-presented examples also reveal that it is the combination of gene expression and methylome data that is more informative than either dataset separately. Once signal detection-derived DMGs are overlapped with RNA-seq data, a smaller number of key biological processes and gene networks emerge. Crucially, they are at the intersection of the studied phenomenon (phosphate starvation and seed germination) and epigenetics, with C-5 methylation of cytosine always being one of the top scoring GO term enriched categories. MethylIT added numerous functionally important genes to the DMG list in both datasets, while removing stochastic variation-associated ones.

Because MethylIT does not rely on DMP density for the detection of DMRs, the bias of assuming only methylation-dense regions as biologically impactful (present in most methods) is eliminated. Instead, signal detection allows the identification of only treatment-associated DMPs, regardless of the proximity or direction of neighboring DMPs. This allows meaningful analysis of intragenic methylation repatterning, which remains the frontier of methylome analysis as it eludes density-dependent approaches. Our re-analysis also demonstrates the power of data reuse with computational approaches that were not available when data were generated, highlighting the importance of FAIR principles in increasing the utility of expensive datasets, like methylomes.

Methods

DMGs from original paper’s pipeline

To avoid any changes due to discrepancies in the reference genome annotation versions, we overlapped the DMRs reported in the original studies with version 38 of the TAIR10 genome annotation to obtain DMGs.

DMGs from MethylIT (signal detection) pipeline

Raw sequencing reads were obtained from the GEO repository for both datasets. They were rimmed with TrimGalore and aligned the TAIR10 reference genome. DMPs were identified using MethylIT (version 0.3.2.4). Standard settings were used as described on the GitHub page (https://genomaths.github.io/methylit/articles/MethylIT.html), excluding modifications described in Supplemental Table S1. To identify the DMGs, we selected loci with at least three DMPs and minimum DMP density of 3 per 10 kbp, followed by group comparison using likelihood ratio test to select loci with log2fold change > 1 and adjusted p-value < 0.05. Gene annotation was done using version 38 of TAIR10.

Functional DMG analysis

DAVID Functional Annotation tool (6.8)⁵⁰ was used for GO term enrichment analysis of DMGs. GO terms with > fourfold (phosphate starvation) and > tenfold (seed germination) enrichment were plotted on heatmaps, using the ggplot2 package.

DMGs were also functionally analyzed using Cytoscape (3.9.1)⁵¹. The STRING database was used to construct the protein–protein interaction network from imported DMGs. The core hub was identified using the clusterMaker plug-in with the k-means cluster function. 3 clusters were calculated with 500 iterations, using Euclidean distance, betweenness centrality, closeness centrality, average shortest path length, clustering coefficient, degree, and eccentricity. The core hubs depicted in the figures were identified as the cluster with the highest centrality. The size of the node corresponds to the degree of connectivity score and the edge transparency corresponds to the STRING database score.

Data availability

The data that support the findings of this study are available in the supplementary material of this article. These data were derived from the following resources available in the public domain on the GEO repository: phosphate starvation data under accession GSE72770 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72770) and seed germination data under accession GSE94459 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE94459).

References

Lister, R. & Ecker, J. R. Finding the fifth base: genome-wide sequencing of cytosine methylation. Genome Res. 19, 959–966 (2009).
Article CAS PubMed PubMed Central Google Scholar
Zhang, H., Lang, Z. & Zhu, J.-K. Dynamics and function of DNA methylation in plants. Nat. Rev. Mol. Cell Biol. 19, 489–506 (2018).
Article CAS PubMed Google Scholar
He, L. et al. DNA methylation-free Arabidopsis reveals crucial roles of DNA methylation in regulating gene expression and development. Nat. Commun. 13, 1335 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Ngo, T. T. M. et al. Effects of cytosine modifications on DNA flexibility and nucleosome mechanical stability. Nat. Commun. 7, 10813 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Sanchez, R., Yang, X., Maher, T. & Mackenzie, S. A. Discrimination of DNA methylation signal from background variation for clinical diagnostics. IJMS 20, 5343 (2019).
Article CAS PubMed PubMed Central Google Scholar
Rauluseviciute, I., Drabløs, F. & Rye, M. B. DNA methylation data by sequencing: experimental approaches and recommendations for tools and pipelines for data analysis. Clin. Epigenet. 11, 193 (2019).
Article CAS Google Scholar
Underwood, C. J., Henderson, I. R. & Martienssen, R. A. Genetic and epigenetic variation of transposable elements in Arabidopsis. Curr. Opin. Plant Biol. 36, 135–141 (2017).
Article CAS PubMed PubMed Central Google Scholar
Zhang, X. et al. Genome-wide high-resolution mapping and functional analysis of DNA methylation in Arabidopsis. Cell 126, 1189–1201 (2006).
Article CAS PubMed Google Scholar
Cokus, S. J. et al. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature 452, 215–219 (2008).
Article ADS CAS PubMed PubMed Central Google Scholar
Lister, R. et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133, 523–536 (2008).
Article CAS PubMed PubMed Central Google Scholar
Law, J. A. & Jacobsen, S. E. Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nat. Rev. Genet. 11, 204–220 (2010).
Article CAS PubMed PubMed Central Google Scholar
Akalin, A. et al. methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biol. 13, R87 (2012).
Article PubMed PubMed Central Google Scholar
Catoni, M., Tsang, J. M., Greco, A. P. & Zabet, N. R. DMRcaller: a versatile R/Bioconductor package for detection and visualization of differentially methylated regions in CpG and non-CpG contexts. Nucleic Acids Res. https://doi.org/10.1093/nar/gky602 (2018).
Article PubMed PubMed Central Google Scholar
Yang, X. & Mackenzie, S. A. Approaches to whole-genome methylome analysis in plants. In Plant Epigenetics and Epigenomics: Methods and Protocols (eds Spillane, C. & McKeown, P.) 15–31 (Springer, 2020). https://doi.org/10.1007/978-1-0716-0179-2_2.
Chapter Google Scholar
Li, J. et al. Epigenetic memory marks determine epiallele stability at loci targeted by de novo DNA methylation. Nat. Plants 6, 661–674 (2020).
Article CAS PubMed Google Scholar
Bewick, A. J. & Schmitz, R. J. Gene body DNA methylation in plants. Curr. Opin. Plant Biol. 36, 103–110 (2017).
Article CAS PubMed PubMed Central Google Scholar
Zilberman, D. An evolutionary case for functional gene body methylation in plants and animals. Genome Biol. 18, 87 (2017).
Article PubMed PubMed Central Google Scholar
Johannes, F. & Schmitz, R. J. Spontaneous epimutations in plants. New Phytol. 221, 1253–1259 (2019).
Article PubMed Google Scholar
Severin, P. M. D., Zou, X., Gaub, H. E. & Schulten, K. Cytosine methylation alters DNA mechanical properties. Nucleic Acids Res. 39, 8740–8751 (2011).
Article CAS PubMed PubMed Central Google Scholar
Kaur, P. et al. Hydrophobicity of methylated DNA as a possible mechanism for gene silencing. Phys. Biol. 9, 065001 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Yusufaly, T. I., Li, Y. & Olson, W. K. 5-methylation of cytosine in CG:CG base-pair steps: a physicochemical mechanism for the epigenetic control of DNA nanomechanics. J. Phys. Chem. B 117, 16436–16442 (2013).
Article CAS PubMed PubMed Central Google Scholar
Sanchez, R. & Mackenzie, S. A. Genome-wide discriminatory information patterns of cytosine DNA methylation. Int. J. Mol. Sci. 17(6), 938. https://doi.org/10.3390/ijms17060938 (2016).
Article CAS PubMed PubMed Central Google Scholar
Becker, C. et al. Spontaneous epigenetic variation in the Arabidopsis thaliana methylome. Nature 480, 245–249 (2011).
Article ADS CAS PubMed Google Scholar
Denkena, J., Johannes, F. & Colomé-Tatché, M. Region-level epimutation rates in Arabidopsis thaliana. Heredity 127, 190–202 (2021).
Article CAS PubMed PubMed Central Google Scholar
Robinson, M. D. et al. Statistical methods for detecting differentially methylated loci and regions. Front. Genet. https://doi.org/10.3389/fgene.2014.00324 (2014).
Article PubMed PubMed Central Google Scholar
Kawakatsu, T. et al. Unique cell-type-specific patterns of DNA methylation in the root meristem. Nat. Plants 2, 16058 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kawakatsu, T., Nery, J. R., Castanon, R. & Ecker, J. R. Dynamic DNA methylation reconfiguration during seed development and germination. Genome Biol. 18, 171 (2017).
Article PubMed PubMed Central Google Scholar
Bouyer, D. et al. DNA methylation dynamics during early plant life. Genome Biol. 18, 179 (2017).
Article MathSciNet PubMed PubMed Central Google Scholar
Narsai, R. et al. Extensive transcriptomic and epigenomic remodelling occurs during Arabidopsis thaliana germination. Genome Biol. 18, 172 (2017).
Article PubMed PubMed Central Google Scholar
Gehring, M. Epigenetic dynamics during flowering plant reproduction: Evidence for reprogramming?. New Phytol. 224, 91–96 (2019).
Article PubMed PubMed Central Google Scholar
Zhou, M. et al. The CLASSY family controls tissue-specific DNA methylation patterns in Arabidopsis. Nat. Commun. 13, 244 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Kartal, Ö., Schmid, M. W. & Grossniklaus, U. Cell type-specific genome scans of DNA methylation divergence indicate an important role for transposable elements. Genome Biol. 21, 172 (2020).
Article CAS PubMed PubMed Central Google Scholar
Smallwood, S. A. et al. Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nat. Methods 11, 817–820 (2014).
Article CAS PubMed PubMed Central Google Scholar
Noble, D. The role of stochasticity in biological communication processes. Prog. Biophys. Mol. Biol. 162, 122–128 (2021).
Article CAS PubMed Google Scholar
Harris, K. D. & Zemach, A. Contiguous and stochastic CHH methylation patterns of plant DRM2 and CMT2 revealed by single-read methylome analysis. Genome Biol. 21, 194 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kundariya, H., Sanchez, R., Yang, X., Hafner, A. & Mackenzie, S. A. Methylome decoding of RdDM-mediated reprogramming effects in the Arabidopsis MSH1 system. Genome Biol. 23, 167 (2022).
Article CAS PubMed PubMed Central Google Scholar
Tirado-Magallanes, R., Rebbani, K., Lim, R., Pradhan, S. & Benoukraf, T. Whole genome DNA methylation: beyond genes silencing. Oncotarget 8, 5629–5637 (2017).
Article PubMed Google Scholar
Sobiak, B. & Leśniak, W. The effect of single CpG demethylation on the pattern of DNA-protein binding. IJMS 20, 914 (2019).
Article CAS PubMed PubMed Central Google Scholar
Muyle, A. M., Seymour, D. K., Lv, Y., Huettel, B. & Gaut, B. S. Gene body methylation in plants: mechanisms, functions, and important implications for understanding evolutionary processes. Genome Biol. Evolut. 14, evac038 (2022).
Article CAS Google Scholar
Sanchez, R. & Mackenzie, S. A. Information thermodynamics of cytosine DNA methylation. PLoS One 11, e0150427 (2016).
Article PubMed PubMed Central Google Scholar
Sanchez, R. & Mackenzie, S. Genome-wide discriminatory information patterns of cytosine DNA methylation. IJMS 17, 938 (2016).
Article PubMed PubMed Central Google Scholar
Yong-Villalobos, L. et al. Methylome analysis reveals an important role for epigenetic changes in the regulation of the Arabidopsis response to phosphate starvation. Proc. Natl. Acad. Sci. U.S.A. https://doi.org/10.1073/pnas.1522301112 (2015).
Article PubMed PubMed Central Google Scholar
Krueger, F. Trim Galore. (2021).
Krueger, F. & Andrews, S. R. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 27, 1571–1572 (2011).
Article CAS PubMed PubMed Central Google Scholar
Buels, R. et al. JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol. 17, 66 (2016).
Article PubMed PubMed Central Google Scholar
Hansen, K. D., Langmead, B. & Irizarry, R. A. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biol. 13, R83 (2012).
Article PubMed PubMed Central Google Scholar
Park, Y., Figueroa, M. E., Rozek, L. S. & Sartor, M. A. MethylSig: a whole genome DNA methylation analysis pipeline. Bioinformatics 30, 2414–2422 (2014).
Article CAS PubMed PubMed Central Google Scholar
Dolzhenko, E. & Smith, A. D. Using beta-binomial regression for high-precision differential methylation analysis in multifactor whole-genome bisulfite sequencing experiments. BMC Bioinformat. 15, 215 (2014).
Article Google Scholar
Omony, J., Nussbaumer, T. & Gutzat, R. DNA methylation analysis in plants: review of computational tools and future perspectives. Brief. Bioinform. 21, 906–918 (2020).
Article CAS PubMed Google Scholar
Sherman, B. T. et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 50, W216–W221 (2022).
Article PubMed PubMed Central Google Scholar
Otasek, D., Morris, J. H., Bouças, J., Pico, A. R. & Demchak, B. Cytoscape Automation: empowering workflow-based network analysis. Genome Biol. 20, 185 (2019).
Article PubMed PubMed Central Google Scholar
Rouached, H. Multilevel coordination of phosphate and sulfate homeostasis in plants. Plant Signal. Behav. 6, 952–955 (2011).
Article CAS PubMed PubMed Central Google Scholar
Puga, M. I. et al. SPX1 is a phosphate-dependent inhibitor of PHOSPHATE STARVATION RESPONSE 1 in Arabidopsis. Proc. Natl. Acad. Sci. U.S.A. 111, 14947–14952 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Banerjee, S. & Roy, S. An insight into understanding the coupling between homologous recombination mediated DNA repair and chromatin remodeling mechanisms in plant genome: an update. Cell Cycle 20, 1760–1784 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lombardo, S. D., Wangsaputra, I. F., Menche, J. & Stevens, A. Network approaches for charting the transcriptomic and epigenetic landscape of the developmental origins of health and disease. Genes 13, 764 (2022).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank Robersy Sanchez and Hardik Kundariya for their computational advice and thoughtful input. This work was funded by support to the Mackenzie lab from the Huck Institutes of the Life Sciences at Penn State University and Grant NIH R01 GM134056-01 to S.A.M.

Author information

Authors and Affiliations

Department of Biology, The Pennsylvania State University, 362 Frear N Bldg, University Park, PA, 16802, USA
Alenka Hafner & Sally Mackenzie
Intercollege Graduate Degree Program in Plant Biology, The Pennsylvania State University, University Park, PA, USA
Alenka Hafner
Department of Plant Science, The Pennsylvania State University, University Park, PA, USA
Sally Mackenzie

Authors

Alenka Hafner
View author publications
You can also search for this author in PubMed Google Scholar
Sally Mackenzie
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.H. and S.A.M. planned and designed the research. A.H. performed the data analysis and prepared the figures. A.H. and S.A.M. wrote the manuscript.

Corresponding author

Correspondence to Sally Mackenzie.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information 1.

Supplementary Table S2.

Supplementary Table S3.

Supplementary Table S4.

Supplementary Table S5.

Supplementary Table S6.

Supplementary Table S7.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Hafner, A., Mackenzie, S. Re-analysis of publicly available methylomes using signal detection yields new information. Sci Rep 13, 3307 (2023). https://doi.org/10.1038/s41598-023-30422-4

Download citation

Received: 04 October 2022
Accepted: 22 February 2023
Published: 27 February 2023
DOI: https://doi.org/10.1038/s41598-023-30422-4

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.