Establishment and maintenance of heritable chromatin structure during early Drosophila embryogenesis

During embryogenesis, the initial chromatin state is established during a period of rapid proliferative activity. We have measured with 3-min time resolution how heritable patterns of chromatin structure are initially established and maintained during the midblastula transition (MBT). We find that regions of accessibility are established sequentially, where enhancers are opened in advance of promoters and insulators. These open states are stably maintained in highly condensed mitotic chromatin to ensure faithful inheritance of prior accessibility status across cell divisions. The temporal progression of establishment is controlled by the biological timers that control the onset of the MBT. In general, acquisition of promoter accessibility is controlled by the biological timer that measures the nucleo-cytoplasmic (N:C) ratio, whereas timing of enhancer accessibility is regulated independently of the N:C ratio. These different timing classes each associate with binding sites for two transcription factors, GAGA-factor and Zelda, previously implicated in controlling chromatin accessibility at ZGA. DOI: http://dx.doi.org/10.7554/eLife.20148.001

The open 'peaks' used below, and .wig formatted versions of the standardized (z-scored) data are deposited in GEO (GSE83851)

Initial (CPM) Normalization
One thing that we want the data to tell us is for any region of the genome, how "accessible" is it? To estimate this, we will calculate the average number of reads coming from fragments derived from "open" chromatin within each 10 base-pair window across the entire genome. The starting data are a mixture of fragments from 'open' chromatin and 'nucleosome associated' chromatin. Although in this case we are only really interested in open chromatin, we observe that the ratio of open:nucleosome fragments varies depending on phase of the cell cycle. This means that if we were to perform simple counts per million (CPM) normalization of the data on only the open fragments, we would inflate the abundance of open fragments in samples with a smaller proportion of open fragments compared with nucleosome fragments. To account for this, we have used the total number of duplicate filtered, high quality reads as the denominator for the CPM normalization, rather than the total number of open fragments alone.
The following approach for counting average coverage per 10 bp window is an adaptation of scripts appearing in the following website: https://bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs. pdf First, we need a set of 10-bp windows from the dm6 build of the genome: library( GenomicRanges) library( GenomicAlignments) library( BSgenome.Dmelanogaster.UCSC.dm6) make.windows = function( binsize) { bins = tileGenome( seqinfo( Dmelanogaster), tilewidth = binsize, cut.last.tile.in.chrom = TRUE) return( bins) } genome.windows = make.windows( binsize = 10) At this point, we should note that dm6 is a pain with its >1800 'chromosomes'. As a matter of practicality, we omit all non-canonical chromosomes. The following string will be used at several points to select only the 'good' chromosomes.
# note that in some cases, you may want chrY as well, as this is helpful for # deciding whether a single embryo is male or female. In practice, you can get # away with simply measuring the X:Autosome ratio of total read coverage per # chromosome length. chrY is only marginally useful, especially since the # overall number of uniquely mapping reads will be significantly lower, and it # will therefore be unlikely that you recover 1:1 read coverage between X and Y # anyway.
good.uns = c( chr2L , chr2R , chr3L , chr3R , chr4 , chrX ) Now that we have defined those variables, we load a dataset. The following line should be changed to reflect a file on your own computer.

load( /Volumes/seq_data/ATAC/Complete_Timecourse_Analysis_Sets/all_bamfiles/GRanges/WT_15042105_NC13_15
# this loads a dataset used in the accompanying manuscript, corresponding to a # single embryo. The next line simply assigns the stored variable name of this # dataset to something more descriptive for the purposes of this demonstration.

all.data = x
We will want to use the total number of reads in all.data as the denominator for CPM normalization.

Read−length distribution for 'all.data'
length ( You can see that that a significant proportion of the fragments are present within the small, open read (< or = 98 bp) range of the distribution. Although it appears that there may be more small reads to the right of the red line, elsewhere I have fit these distributions to a single exponential and three gaussians and found that 98 bp is the greatest size that I can use, in this dataset, to declare a region to be open, and have less than a 5% likelihood of a read actually belonging to the +1 nucleosome distribution to the right. With different datasets, this cutoff may be different. I believe Buenrostro 2013 used a 100 bp cutoff which would probably be fine as well.

Standardization of Data
In the associated manuscript, what I've done is performed ATAC seq on multiple related biological samples, and part of the analysis requires comparison between these replicates.
The following line loads a GRangesList object that the user will have to generate independently, which consists of the output of the above section (CPM normalized average coverage in 10 bp windows), applied to each of the timepoints measured for wild-type embryos in the associated manuscript. Each timepoint consisted of at least three independent biological replicates, and for the purposes of analysis, all replicates for each timepoint were pooled together prior to CPM normalization.

# this opens a saved object named wt.open
As part of the analysis in the associated manuscript, I have called peaks on these data and these represent regions of interest for further analysis. We'd like to compare reads in each of these regions.
The following line loads these up in GRanges format. The user will have to point the next line to the appropriate file on their own computer. Like with any genomic dataset, the relative proportion of 'high coverage' 10 bp windows is a small fraction of the total number of 10 bp windows. The majority of these high coverage windows are captured in the peaks list. In order to compare how the counts within peaks compare across timepoints, I have chosen to further standardize the coverage scores.
Standardization transforms the coverage values from counts per million to (effectively) the number of standard deviations above or below the population mean. To do this, I calculate the average number of CPM in a set of noisy or background regions, and then scale the entire dataset so that these background regions have a mean value of zero and a standard deviation of one.
The following section illustrates how this is done in practice.  .open.peaks))))), fix = center )) # this models a set of noise peaks whose centers are the regions we sampled # above, and whose widths are randomly distributed over the mean and sd of the # log2 widths of the true peaks list. The log2 widths of the peaks approximates # a normal distribution.