CATaDa reveals global remodelling of chromatin accessibility during stem cell differentiation in vivo

During development eukaryotic gene expression is coordinated by dynamic changes in chromatin structure. Measurements of accessible chromatin are used extensively to identify genomic regulatory elements. Whilst chromatin landscapes of pluripotent stem cells are well characterised, chromatin accessibility changes in the development of somatic lineages are not well defined. Here we show that cell-specific chromatin accessibility data can be produced via ectopic expression of E. coli Dam methylase in vivo, without the requirement for cell-sorting (CATaDa). We have profiled chromatin accessibility in individual cell-types of Drosophila neural and midgut lineages. Functional cell-type-specific enhancers were identified, as well as novel motifs enriched at different stages of development. Finally, we show global changes in the accessibility of chromatin between stem-cells and their differentiated progeny. Our results demonstrate the dynamic nature of chromatin accessibility in somatic tissues during stem cell differentiation and provide a novel approach to understanding gene regulatory mechanisms underlying development.


Introduction
During the development of a multicellular organism, gene expression is tightly regulated in response to spatially and temporally restricted signals. Changes to gene expression are accompanied by concomitant changes to chromatin structure and composition. Therefore chromatin states vary widely across developmental stages and cell types. Functional regions of a genome, including promoters and enhancers, can be identified by their relative lack of nucleosomes. These regions of "open chromatin" can be assayed by their accessibility to extrinsic factors. Consequently, chromatin accessibility profiling techniques are commonly used to investigate chromatin states (reviewed in [1]). Chromatin is highly accessible in pluripotent cell types such as embryonic stem (ES) cells, but is compacted following differentiation [2]. It has been suggested that this open chromatin represents a permissive state to which multiple programmes of gene regulation may be rapidly applied upon differentiation [3].
The nature of chromatin accessibility across different developmental stages in vivo is less well understood. Imaging studies have been used to demonstrate gross changes to chromatin structure, for example changes to the distribution of heterochromatin have been observed in post-mitotic cells [4,5]. Molecular studies investigating chromatin states in vivo during development have tended to utilise heterogeneous tissues due to the fact that profiling the epigenome of individual cell types frequently requires physical isolation of cells or nuclei, which can be laborious and prone to human error [6].
Therefore, there is a lack of information regarding cell-type specific changes to chromatin states in in vivo models. Whilst recently developed methods such as ATAC-seq have become popular and address many of the limitations inherent to earlier techniques such as DNAse-seq (i.e. requires fewer cells and increased assay speed), these techniques still require the physical separation of cells and isolation of genomic DNA before chromatin accessibility is assayed [7].
It has been suggested that ectopic expression of untethered DNA adenine methyltransferase (Dam) results in specific methylation of open chromatin regions whilst nucleosome bound DNA is protected [8][9][10][11]. However, the efficacy of using Dam methylation for chromatin accessibility profiling on a genomic scale is not clear. Furthermore, expression of Dam in a cell-type specific manner, at levels low enough to avoid toxicity and oversaturated signal, has not been possible until now.
Transgenic expression of fusions of Dam to DNA binding proteins is a well-established method used to assess transcription factor occupancy (DNA adenine methyltransferase identification -DamID) [12].
Recently it was demonstrated that DamID could be adapted to profile DNA-protein interactions in a cell type specific manner by utilising ribosome re-initiation to attenuate transgene expression [13][14][15].
This technique is referred to as Targeted DamID (TaDa). Here we take advantage of TaDa to express untethered Dam in specific cell-types to produce chromatin accessibility profiles in vivo, without the requirement for cell separation. We show that Chromatin Accessibility profiling using Targeted DamID (CATaDa) yields comparable results to both FAIRE and ATAC-seq methods, indicating that it is a reliable and reproducible method for investigating chromatin states. By assaying multiple cell types within a tissue, we show that chromatin accessibility is dynamic throughout the development of Drosophila central nervous system (CNS) and midgut lineages. These data have also enabled us to identify enriched motifs from regulatory elements that dynamically change their accessibility during differentiation, as well as to identify functional cell-type specific enhancers. Finally, we show that compared to their differentiated progeny, somatic stem cell Dam-methylation signals are more widely distributed across the genome, indicating a greater level of global chromatin accessibility.

Results
CATaDa produces chromatin accessibility profiles comparable to that of ATAC and FAIRE-seq in Drosophila eye discs. We reasoned that low-level expression of transgenic E. coli Dam, using tissue specific GAL4 drivers in Drosophila, would specifically methylate regions of accessible chromatin exclusively in a cell-type of interest. Detection of these methylated sequences could yield chromatin accessibility profiles for defined cell populations in vivo (Figure 1). To determine if CATaDa produces an accurate reflection of chromatin accessibility, we compared data acquired using this approach with commonly used alternative techniques. A recent study generated ATAC and FAIRE-seq data from Drosophila imaginal eye discs [16].
Using CATaDa, we expressed E. coli Dam in the eye disc of Drosophila third instar larvae so that we could compare Dam methylation profiles to these previously collected data. Chromatin accessibility profiles produced with CATaDa are highly reproducible between replicates (r 2 = 0.947) (Supplementary figure 1). CATaDa profiles showed good agreement with data produced with ATAC-seq and FAIRE-seq. Visual inspection of the data showed that many regions of accessible chromatin identified by ATAC and FAIRE are also represented by CATaDa, whilst condensed regions are reliably inaccessible (Figures 2A,B). The overlap of Dam identified peaks with ATAC and FAIRE peaks is 48.5% and 49.4% respectively (In comparison, 64.8% of ATAC peaks are also identified in FAIRE data).
A Monte Carlo simulation determined that this is a highly significant overlap (p< 1x10 -5 ). We found that CATaDa profiles exhibited features consistent with chromatin accessibility. For example, open chromatin is enriched at transcriptional start sites (TSS) ( Figure 2C). This is despite the relative depletion of GATC sites around TSS (Supplementary figure 2). It was previously shown that ATAC-seq and FAIRE-seq data demonstrated high chromatin accessibility at experimentally validated eye-antennal enhancers.
CATaDa profiles similarly showed increased open chromatin at these regions ( Figure 2B,D). We found that for 57.9% of FlyLight eye enhancers, a corresponding peak was called in CATaDa profiles (333 of 575 enhancers). CATaDa was comparable to FAIRE-seq and ATAC-seq which identified 48% and 68.7% respectively, of validated FlyLight enhancers as peaks ( Figure 2E).  CATaDa profiling shows dynamic changes in chromatin accessibility during differentiation of the nervous system.
In Drosophila, neurons are derived from asymmetrically dividing neural stem cells (NSCs). NSC divisions produce one self-renewing daughter NSC and a ganglion mother cell (GMC), which divides once more to produce neurons or glia [17]. To investigate how local and global chromatin accessibility changes during the process of nervous system differentiation, we expressed Dam in specific cells with GAL4 drivers that cover four different developmental stages within the lineage. These include NSCs (worniu- whilst these peaks are reduced or absent in the progenitor cell types ( Figure 3B). This corresponds with the expression of brp, which is specifically transcribed in neurons and has an important role in synapse function [19]. In contrast, the adjacent gene to brp, Wnt2, displays peaks which are most apparent in the NSC and intermediate cell types. Wnt signalling is known to be important for the control of stem cell populations, therefore these results are also expected [20]. Regions of open chromatin are thought to identify functional regulatory elements such as enhancers.
Therefore, it is to be expected that these regions will be enriched for motifs belonging to transcription factors involved in neurogenesis. Identification of enriched motifs in sequences that were accessible in NSCs showed that expected transcription factor binding sites were highly enriched. For example, the Ebox motif -CAGCNG -which is bound by the NSC proneural factor ase ( Figure 3D) [21,22]. Regions in which open chromatin was specifically enriched in mature neurons yielded novel sequence motifs for which we could not identify known transcription factors.
Gene ontology (GO) analysis of genes at which enriched chromatin accessibility was observed yielded expected biological process terms for each of the cell types examined. For example, terms such as "neuroblast fate determination" and "chromosome segregation" were enriched in stem cells relative to neurons, whilst "regulation of behaviour" and "synaptic vesicle docking during exocytosis" were enriched for differentiated neurons but not NSCs (Fig 3E).
Chromatin accessibility in adult midgut cell types.
Having observed chromatin accessibility changes in the cells of the developing CNS, we asked whether similar patterns would be observed in adult somatic stem cell lineages. The Drosophila midgut contains a pool of cycling intestinal stem cells (ISCs) that persists in the adult to maintain a population of terminally differentiated cells which mediate the absorptive and secretory functions of the organ [23,24]. In contrast to neurogenesis, a single committed immature progenitor cell (enteroblast -EB) is produced from stem cell divisions, which then differentiates without further divisions to produce the mature epithelial cells of the midgut [25]. To examine chromatin accessibility in the cells of the adult midgut, we expressed Dam in the ISCs and EBs, as well as in the terminally differentiated absorptive cells, the enterocytes (ECs)( Fig 4A).
As with the CNS data, we observed predictable changes in chromatin accessibility at loci for genes with variable expression in the lineage. For example, escargot (esg) a transcription factor required for ISC selfrenewal [26], displays multiple peaks of accessible chromatin at the gene body and surrounding region in ISCs and EBs, whilst little signal is observed in the ECs (Fig 4B). In contrast the nubbin locus (encoding EC marker -pdm1), displays peaks predominantly in the EC data, with relatively closed chromatin in the progenitor cell types ( Fig 4C). As previously, hierarchical clustering revealed two major groups in which accessible chromatin was unregulated in either ISCs or ECs ( Figure 4D) component analysis reveals two distinct clusters for the two lineages in which >80% of the variance is explained in the first two principle components (Fig 4E). By examining the overall correlation between all cell types we observed a number of interesting features. Firstly, as expected all cell types correlated most closely with either their direct progeny or progenitor cell ( Figure 4F). Therefore by clustering the data we were able to recapitulate the familial relationship between the cell types of the two lineages.
The greatest similarities were observed between the intermediate progenitors and their cognate stem cells (R 2 = 0.94 / 0.98 for CNS and midgut respectively). Interestingly, the greatest correlation outside of a lineage was between the two stem cell types (R 2 =0.76), whilst differentiated cells exhibited only weak correlation (ISCs vs NSC, R 2 = 0.51). This indicates that somatic stem cell types may utilise a broadly similar chromatin landscape for the maintenance of multipotency, whilst lineage specific variation is relatively small.

Enhancer prediction from Dam Accessibility data
Enhancer activity is closely linked to gene expression, therefore many tissue specific enhancers are required to orchestrate correct spatial and temporal transcription [27]. However, identification of functional enhancers can be challenging. Chromatin accessibility data has previously been used to identify novel enhancers [16,28]. We reasoned that it would be possible to identify genomic regions corresponding to cell-type specific enhancers by comparing dynamically accessible regions between cell types. In support of this, we observed that the sequence covered by the 71C09-GAL4 line used in this study to profile GMCs/newly born neurons, displayed a higher peak specifically at this region than in either the stem cell or differentiated neuron data (Supplementary figure 4). Interestingly, a clear peak can still be observed in the NSC data, without concomitant reporter expression. Therefore, an enrichment of accessible chromatin does not necessarily correspond to an active enhancer in a given cell type. This is consistent with previous observations that DNase hypersensitive regions are often not active enhancers [29,30].
We selected accessible regions with large differences between at least two cell types in the lineage, which satisfied various criteria for us to designate them as putative enhancers (see methods). We then identified available reporter lines from the Vienna tiles (VT) [31] and FlyLight [32] collections of GAL4 reporter lines that contained sequences encompassing our predicted enhancers upstream of a GAL4 reporter, and verified their expression in the tissues of interest. We identified enhancer-GAL4 lines in which reporter expression matched our predictions for enhancer activity. In the CNS Vienna line VT017417 and FlyLight line GMR56E07 both showed expression in the early part of the lineage in the CNS, with GFP reporter expression detectable predominantly in NSCs and GMCs (Fig5-A,B). This is consistent with accessible chromatin readings from our CATaDa data for these cell types in which progenitor cells displayed prominent peaks, whereas differentiated neurons did not. Similarly, we were able to detect functional cell-type specific enhancers in the midgut. The Vienna line, VT004241, showed reporter expression predominantly in Delta positive ISCs (Fig 5C). Therefore, it is possible to use CATaDa data to identify novel cell-type specific enhancers in multiple tissues.  GABAergic, and glutamatergic) and compared to data derived from NSCs.
From these distributions it is apparent that majority of GATC fragments in the genome have very few corresponding mapped reads in all samples, whilst fragments having over ~10 reads per million (rpm) were relatively infrequent. In other words, most of the genome is inaccessible or accessible at very low levels, whilst hyper-accessible regions are comparatively rare in all cell types (Fig 6A). The greatest difference between the distributions of neurons and NSCs was apparent at very low read numbers. We observed that there was a significantly greater proportion of GATC fragments with low read counts but without being completely inaccessible (~1-3rpm), in the NSC data compared to neurons (Fig 6A). This  Figure 6. Global chromatin accessibility is reduced in differentiated neurons. A) Log transformed distribution of read counts at GATC fragments for neuronal (pink) or NSC replicates (blue). In addition to adult neuron data described in previous figures, CATaDa data for cholinergic, glutamatergic, and GABAergic adult neurons are included. NSC data includes extra replicate from [13]. B) Areas under curve for region bound by dotted lines in (A). Corresponding to ~1-3rpm. Data includes extra neuronal and NSC replicates shown in (A), as well as corresponding replicates at L3 for gutamatergic, GABAergic, and cholinergic neurons. Note that area under curve corresponds to proportion of GATC fragments having mapped reads within indicated range (i.e. NSCs have ~38% (median) of all GATC fragments within 1-3rpm mapped reads, compared to ~32% for adult neurons. Results were considered significant at *P < 0.05.

Discussion
Recent studies have provided insights into chromatin accessibility of individual cell-types using accessibility assays coupled with cell sorting [33]. Whilst these strategies have been proven to produce meaningful biological data, they suffer from being technically challenging, particularly with regards to cell isolation. We have demonstrated that CATaDa yields chromatin accessibility profiles for defined cell types in vivo without the need for cell isolation, fixation or the extraction of naked chromatin. In addition to its ease of use, CATaDa also has the advantage that the marking of accessible DNA occurs in vivo. Due to this, there are no artefacts associated with cell disruption, chemical fixation or washing of the chromatin prior to the assay [34]. In addition, as Dam is expressed in vivo over several hours, the profiles produced will reflect dynamic changes to chromatin structure over the entire time period during which Dam is expressed.
CATaDa is limited by its resolution, which is restricted by the frequency of GATC sites in the genome (median spacing of ~200 bp in Drosophila). However, this can be increased by using a modified Dam in conjunction with immunoprecipitation (Dam-IP) [35]. Due to the dependence of Dam for methylation of GATC sequences, biases may be observed at loci which are depleted for GATC. It is worth noting that extensive sequence biases have also been reported for DNAse and ATAC-seq [36]. Although single cell protocols have recently been developed for chromatin accessibility techniques [37,38], for routine experiments it is more usual to require a relatively large number of cells. For example for FAIRE-seq it is recommended to have a minimum of 1x10 6 cells [1,39], whilst DNase-seq typically requires 1x10 7 cells [1,40]. In contrast, DamID experiments can be performed with as few as 1000 cells [41], therefore CATaDa is likely to also be effective with low cell numbers, making it competitive with ATAC-seq (500-50,000 cells) [7,13]. Furthermore, single cell DamID has also recently been demonstrated, indicating that the minimum number of cells required for CATaDa is one [42].
Targeted DamID is rapidly being embraced by the Drosophila community with, at present, over 135 laboratories having requested the reagents and a number of papers already published ( [15,[43][44][45]).
Progress is also being made in adapting it for use in vertebrate models [41]. Whenever the binding of a protein of interest (POI) is investigated with this technique, Dam-only data (representing chromatin accessibility) for the cell type being assayed is also generated, as it is the control for which the Dam-POI is normalised. Therefore, researchers performing DamID experiments can now take advantage of this data, getting a "2-for-1 deal" whenever they use TaDa to profile the binding of a POI. Furthermore, much of this data is already available from published studies that could be readily analysed to provide novel biological insights.  [47]. It has been suggested that in some physiological contexts differentiated cells may revert to replenish stem cell pools [48,49]. This idea may help to explain retention of plasticity in these cells types. Overall, our results from profiling developing cell types illuminate the dynamic nature of chromatin accessibility in differentiation, and hint at organising principles which may apply to all somatic stemcell lineages.

Targeted DamID (TaDa)
To induce tissue specific Dam expression, GAL4 driver lines were crossed to GAL80 ts ; UAS-LT3-NDam virgin females. Embryos were collected for 4 hours then raised at 18ᵒC. Animals were transferred to 29ᵒC at either 7 days after embryo deposition for 24 hours to obtain third instar larval tissues, or three days after eclosion to obtain adult heads or midguts. Fifty brains or thirty antennal-eye discs and midguts were dissected in PBS with 100 mM EDTA for each replicate. 71C09-GAL4 > UAS-LT3-NDam ventral nerve cords (VNCs) were dissected and central brain and optic lobe regions discarded due to presence of observed 71C09-GAL4 expression in a small subset of central brain neuroblasts.
Genomic DNA extraction and sequencing library preparation was performed as described previously [13], with minor modifications -MyTaq (Bioline) was used for PCR amplification of adapter ligated DNA . Libraries were sequenced using Illumina HiSeq single-end 50 bp sequencing. Two replicates of at least 10 million reads were acquired for each cell type. Sequencing data were mapped back to release 6.03 of the Drosophila genome using a previously described pipeline [54].
Samples were imaged using a Zeiss LSM510 confocal microscope.

Peak calling
Peaks were called and mapped to genes using a custom Perl program (available on request). In brief, a false discovery rate (FDR) was calculated for peaks (formed of two or more consecutive GATC fragments) for the individual replicates. Then each potential peak in the data was assigned a FDR. Any peaks with less than a 1% FDR were classified as significant. Significant peaks present in both replicates were used to form a final peak file. Any gene (genome release 6.11) within 5 kb of a peak (with no other genes in between) was identified as a potentially regulated gene.

Motif identification
Regions of differentially accessible chromatin were identified by calculating GATC sites for which a difference of >10 rpm was observed between every replicate between at least two cell types. Hierarchical clustering was performed using Morpheus [55]. Sequences for major clusters showing enrichment for a given cell type were then analysed using MEME-ChIP [56].

Gene ontology analysis
Potentially regulated genes, with a peak height of at least 10 rpm were submitted to GOToolBox [57] for GO analysis. GO term enrichments (frequency in data set divided by expected frequency) were calculated for each cell type.

Cell type specific enhancer prediction
GATC fragments were identified with at least a 10 rpm difference in all replicates between at least two cell types. Any peak >2 kb from a transcriptional start site that did not overlap with coding sequence was designated an enhancer. Enhancers satisfying these criteria, which were covered by an available Vienna VT [31] or Janelia FlyLight [32] GAL4 line were chosen for validation based on the magnitude of the change in accessibility.

Statistical analysis and data presentation
For comparing Dam data with ATAC and FAIRE, Monte Carlo experiments were performed using a custom Perl script (available on request). Comparisons of areas under curve in figure 5 were performed using Welch's ANOVA for heteroscedasticity with Games-Howell post-hoc test in R. For data with equal variances, ANOVA was used with Tukey post-hoc testing (e.g. supplementary figure 6). Results were considered significant at *P < 0.05, **P < 0.01. Principle component analyses and correlation matrix plots were produced using deepTools [58]. Average TSS signal profiles were made using SeqPlots R/ Bioconductor package [59]. All other figures were produced using the ggplot2 package in R.

Supplementary figure 1. Correlation between eye-disc Dam replicates.
Correlation between replicates shows good agreement indicating that the technique has high reproducibility. R 2 = 0.94.