HMMRATAC, The Hidden Markov ModeleR for ATAC-seq

ATAC-seq has been widely adopted to identify accessible chromatin regions across the genome. However, current data analysis still utilizes approaches originally designed for ChIP-seq or DNase-seq, without taking into account the transposase digested DNA fragments that contain additional nucleosome positioning information. We present the first dedicated ATAC-seq analysis tool, a semi-supervised machine learning approach named HMMRATAC. HMMRATAC splits a single ATAC-seq dataset into nucleosome-free and nucleosome-enriched signals, learns the unique chromatin structure around accessible regions, and then predicts accessible regions across the entire genome. We show that HMMRATAC outperforms the popular peak-calling algorithms on published human and mouse ATAC-seq datasets.


23
and those spanning 1, 2 or 3 nucleosomes [5]. Non-overlapping and non-adjacent cutoffs 114 were used. For example, lengths below 100 bp represent nucleosome-free signals and 115 lengths between 180 and 250 bp represents mono-nucleosome signals. We consider such 116 cutoff-based strategy not a general practice for the following two reasons. First, different 117 species or different cell lines may have different nucleosome spacing and these differences 118 may not be properly understood [12,13]. Secondly, since the fragments whose size falls 119 between the non-overlapping and non-adjacent cutoffs are discarded (e.g. those between 120 100 and 180 bp long), the recall (or sensitivity) to identify accessible regions will be 121 reduced (Additional file 1: Fig. S2 and explained later). As a fact, about 15% of the entire 122 data generated from the original paper are ignored with their strategy. We addressed the 123 above two challenges in a probabilistic approach in HMMRATAC. those below a minimum length (see method section for more details). It should be noted 214 that many of the nucleosomes are not adjacent to the center state. These may represent 215 non-regulatory nucleosomes that ATAC-seq is able to identify or may be artifacts due to 216 increased transposition frequency. 217

218
Recapitulating chromatin architecture at functional elements 219 220 We hypothesized that HMMRATAC could identify chromatin architecture around 221 accessible regions, or ideally, nucleosomes near the regulatory elements. To prove this, 222 we explored the distribution of histone modification ChIP-seq signal around our identified 223 accessible regions. We downloaded and analyzed datasets released by ENCODE in the 224 signal enrichment only, as opposed to HMMRATAC which incorporates the structure of 251 the element, they both falsely identified this region as accessible, while HMMRATAC 252 identified the region as being a nucleosome state. 253

254
In order to understand the chromatin features that are enriched in each state across the 255 genome, as the unique result HMMRATAC is able to conclude from a single ATAC-seq 256 data, we plotted the overlap of the 3 hidden states with chromatin states and histone mark 257 calls from ChIP-seq in the GM12878 cell line based on ENCODE data (Fig 3a; Additional 258 file 1: Table S1). We found that the chromatin states "Active Promoters", "Strong 259 Enhancer", "Insulator", "Poised Promoter" and "Weak Promoter" states mainly overlaps 260 with our center state, and the "Heterochromatin", "Repetitive", "Repressed", 261 "Transcription Elongation", and "Weak Transcription" barely overlaps with the center 262 state. Interestingly, the "Strong Enhancer" state had a similar amount of overlap with the 263 nucleosome state as the center state. Combined with our observation that our nucleosome 264 state showed elevated levels of active histone marks near accessible regions, we 265 hypothesized that this state may represent regulatory nucleosomes, those marked by active 266 histone modifications and adjacent to accessible regions. We plotted the H3K4me1 active 267 histone mark around the centers of our center states as well as around the centers of the 268 adjacent nucleosome states, those directly flanking our center states (Fig 3b) calculates enrichment based on the resulting Gaussian distribution. Although none of them 294 utilize the intrinsic and unique features in ATAC-seq data we described before, MACS2 295 has been used to analyze ATAC-seq data in many published works [6,14,[22][23][24][25][26], and tested the performance of the four algorithms using the two ATAC-seq datasets in 298 GM12878 cell line [5] as the previous sections. The first proxy of the true set that we compared the algorithms to is referred to as "active 310 regions." These sites were defined through an integrative chromatin state segmentation 311 using chromHMM [28,29] algorithm conducted by ENCODE project. 8 histone 312 modifications (H3K27me3, H3K36me3, H4K20me1, H3K4me1, H3K4me2, H3K4me3, 313 H3K27ac and H3K9ac), 1 transcription factor (CTCF) ChIP-seq and 1 whole genome input 314 datasets in the GM12878 cell line were integrated through an unsupervised machine 315 learning approach and each genomic location was annotated as one of 15 mutually-316 exclusive states. Two resulting states, "active promoters" and "strong enhancers", were 317 merged together to create our list of "active regions". Additionally, the state annotated as in terms of precision, recall, and false positive rate in identifying these "active regions" 321 while avoiding "heterochromatin" (Fig. 4 top row). The area under the curve (AUC) was 322 then calculated for the recall vs. false positive rate curves (AU-ROC) and for the precision-323 recall curves (AU-PRC) (Table 1), after adding extreme points to the unreachable ends of 324 the curves. HMMRATAC was proved to have better AUC than the two other methods. It 325 worth noting that HMMRATAC calls larger peaks than the other two methods due to its 326 merging of the adjacent nucleosome states to the center state. However, despite the larger We calculated the number of reproducible accessible regions, defined as overlapping 360 accessible regions with an IDR score <= 0.05, between each pair of three replicates for 361 each algorithm. We show that when using the individual replicate data (Additional file 1: 362 Table S2) with a smaller number of cells used in ATAC-seq. We tested how well each algorithm 369 performed at recapitulating the accessible regions called from the 50,000 cell per-replicate 370 data and the 500 cell per-replicate data. We found, similar to the results from pair-wise 371 replicate testing, that HMMRATAC was able to identify a similar number of reproducible 372 accessible regions, with IDR <= 0.05, using datasets with a different number of cells 373 (Additional file 1: Table S2). These data indicate that HMMRATAC is comparable to 374 MACS2, and F-Seq in identifying reproducible accessible regions. H3K9me3. We chose five separate data sets generated by that group: RPMI treatment after 386 modification as the "real negatives" (See Method section for more details), we compared 389 the performance of the algorithms in the same way we compared their performance in 390 human data (Fig 4 2  Identifying transcription factor footprints, wherein the DNA protected by a binding protein 420 shows resistance to nonspecific DNA digestion, is an area of active research. Although 421 footprint identification may be possible with certain datasets or for certain transcription 422 factors, we believe that HMMRATAC identified summits, or the position with the most 423 insertion events within the open region, are more robust to identify potential transcription 424 factor binding sites. In fact, it has been previously shown that DNase-seq signal intensity 425 at a binding motif is a better predictor of transcription factor binding than the strength of a 426 footprint [34], and we believe this principle applies to ATAC-seq as well. However, a 427 major reason for this result from He et al., was that many footprints were artifacts caused 428 by the enzymes' sequence bias. The Tn5 transposase also has a sequence bias, which 429 would need to be corrected before an effective footprint analysis could occur. Several 430 methods exist [35,36] that could correct the sequence bias from an ATAC-seq library. 431 HMMRATAC could be used downstream of such sequence corrections, to reduce bias' in 432 peak calling, or upstream of the corrections, to identify the reads within peaks that need to 433 be corrected for footprint or other downstream analysis.
It is possible to extend HMMRATAC to identify differentially accessible regions between 436 two or more conditions. Most of the approaches for identifying such differences from 437 ATAC-seq data would suffer from the same problems mentioned before, namely, that these 438 methods only consider the relative strength of the signal and the differences in signal 439 intensity between conditions. As we've shown here, the incorporation of the structure of 440 regulatory chromatin into a model could increase the accuracy and sensitivity of calling 441 accessible regions. We believe our strategy of "decomposition and integration" can be 442 adopted in a differential accessibility analysis pipeline, by studying the differential NFRs 443 and nucleosomal signals separately and then combining together. published ATAC-seq datasets increases rapidly, researchers are still using methods 453 originally developed for ChIP-seq to analyze the data. We present HMMRATAC as a 454 dedicated algorithm for the analysis of ATAC-seq data. We take advantage of the nature 455 of the ATAC-seq experiment that due to the lack of size-selection while preparing the DNA 456 library in its protocol, DNA fragments associated with well-positioned nucleosomes shouldn't be applied to those ATAC-seq data generated with a stringent size-selection on 459 DNA fragments. Different from current methods where only the read enrichment is 460 considered, HMMRATAC separates and integrates ATAC-seq data to create a statistical 461 model that identifies the chromatin structure at accessible and active genomic regions. We for regions whose read coverage fold changes over background falls within a certain range. 542 The default lower and upper limit of fold changes, as used in the Result section, is set 543 between 10 and 20. These regions are then extended by 5KB in either direction. It is also 544 possible for users to provide a BED file containing predefined training regions. The initial 545 HMM has proportional initial probabilities, which are not updated during training, as well 546 as proportional transition probabilities, which are updated during training. The emission 547 probabilities are calculated with k-means clustering [39], by separating the training regions 548 into 3 clusters and calculating the means and covariance matrices for the four signal tracks for each cluster. We model the emissions as a multivariate Gaussian distribution since this 550 assumption has been successfully adopted in other HMM-based genome segmentation 551 methods such as hiHMM [40,41] and Segway [42,43] Once the HMM has been created and refined, the Viterbi algorithm [17] is used to classify 561 every genomic position into one of the three states of our model. In our experience, Viterbi 562 has trouble when encountering a very high coverage region. Therefore, regions whose z-563 scored coverage is above a certain cutoff, either a default or user-defined value, are masked 564 by the Viterbi algorithm. Additionally, it is possible to include a list of genomic regions 565 to mask along with the extremely high coverage regions, such as annotated blacklisted 566 sites. We have found that 91% of the high coverage regions that are masked from the 567

Viterbi algorithm in the GM12878 cells are annotated blacklist regions [44]. A genome-568
wide BedGraph of all state annotations can be reported by HMMRATAC, although this is 569 not the default behavior. By default, HMMRATAC will take all regions classified as 570 belonging to the center state and whose length is above a default or user-defined threshold 571 and merge them with the nucleosome states located both upstream and downstream of the open state. We generally consider the nearby nucleosomes to be part of the regulatory 573 region ( Fig. 3a and b), and therefore the region is then reported as a peak in gappedPeak 574 format. 575 576 It should be noted that at all stages of the algorithm, HMMRATAC does not store the read 577 or coverage information in memory. Any information needed by the program is read from 578 the desired file and used at the specific time. This was incorporated to reduce the memory 579 usage of HMMRATAC. However, because of this feature, it is not recommended to run 580 HMMRATAC simultaneously with the same data files. It is, however, possible to process 581 multiple different data sets in parallel. In terms of system requirements, the only required 582 system component is to have Java 7 or higher installed. We have tested HMMRATAC on 583 OpenJDK 7, standard edition JRE 8 and the standard edition JRE 10. Finally, in terms of 584 runtime, HMMRATAC took approx. 4.9 hours to process the GM12878 data set and an 585 average of 5.6 hours to process the monocyte data in our server equipped with AMD 586 Opteron(tm) Processor 6378 and 500G memory. 587 588 Post-processing and output 589 590 When the report peak option is set for HMMRATAC (option -p True), the peak region is 591 reported in gappedPeak format. The total peak region reported is a region that was 592 classified as being in the center state of the model (and is longer than a default or user-593 supplied length cutoff) along with the nucleosome states, both upstream and downstream.
It is also possible to report the average read coverage, the median coverage, and the fold 596 change above the genomic background, the zscore of the average coverage or all values. 597 Additionally, those high coverage regions that are masked from Viterbi decoding (see 598 above), are added back to the final peak file. They are annotated by HMMRATAC as 599 "High Coverage Accessible regions". The user may then keep or discard these regions and 600 explore them as they see fit. We tested the performance of HMMRATAC and other methods against three "gold 632 standard" datasets. The first one, which we call "Active Regions", is made by combining 633 two chromatin state annotations generated in GM12878 cells using chromHMM [28,29] 634 (wgEncodeBroadHmmGm12878HMM). These two annotations are "Active Promoters" 635 and "Strong Enhancers". 636

637
The other two "gold standard" test sets that were used in this study were DNaseI 638 hypersensitive sites (wgEncodeAwgDnaseUwdukeGm12878UniPk) and FAIRE 639 accessible regions (wgEncodeOpenChromFaireGm12878Pk) [11,30]. We found that there Regions" became the "gold standard" set for each monocyte condition. Any H3K9me3 667 peak that did not overlap the "Active Region" set, was kept as the "real negative" set for 668 that condition. The accession numbers for all above histone modification ChIP-seq data 669 are listed in Additional file 2. Version 2.1 of MACS was used to call peaks using the sorted paired-end BAM files created 674 as described above. With each run the input file format (option -f) was set to "BAMPE" 675 and the option for keeping duplicate reads (option --keep-dup) was set to "all", as duplicate 676 reads had already been discarded in pre-processing. MACS2 was run with the local lambda 677 option turned off (option --nolambda) and was run with q-value cutoffs (option -q) set to 678 The standard deviation cutoff (option -t) was set to 0.1, 1, 4 (default), 6, 7, 8, 10, 15 and 686 20. 687 688

Calls from HMMRATAC 689
HMMRATAC was run using the sorted BAM files, BAM index files and genome-wide 690 coverage files created as described previously. The upper limit fold-change for choosing 691 training regions (option -u) was set to 20 with corresponding lower limits (option -l) set to 692 10. These settings are the HMMRATAC defaults. HMMRATAC was also run using 693 blacklisted sites for hg19 [11,15], which were masked from the program and all subsequent 694 steps. For the precision, recall and false positive rate calculations, the HMMRATAC peaks 695 were filtered by minimum score cutoffs of 0, 5, 10, 15, 20, 25, 30, 35, 40 and 45. 696 Additionally, for two monocyte samples (accession numbers GSM2325689 -RPMI_4h 697 and GSM2325681 -BG_d1) HMMRATAC did not produce a typical model with the 698 default settings (-u 20 -l 10). Instead, the models produced showed the nucleosome state 699 as having the highest NFR signal. We re-ran these samples with more conservative settings 700 (-u 40 -l 20), at which time HMMRATAC produced models that showed the characteristic 701 pattern that we have seen with the other samples. These peak results from HMMRATAC 702 were then filtered by the same cutoffs as the other samples and tested in the same way. All 703 the transition and emission parameters learned by HMMRATAC from GM12878 50k cells, 704 described by Li et al. [33]. Peaks were called using the following parameters: the p-value 710 for MACS2 was set to 0.6 for 50,000 cell per-replicate merged data and 0.6 for 500 cell 711 per-replicate merged data, the standard deviation for F-Seq was set to 0.1 for both merged 712 datasets, and the fold-change range for HMMRATAC was set to 10-20 (default) for both 713 merged datasets. The parameters for each individual 50,000 cell replicate was the same for 714 F-seq and HMMRATAC and the MACS2 q-value setting was set to 0.5. These parameters 715 were also the most permissive settings used throughout this paper, including in our PRC 716 and ROC curves. The MACS2 setting was changed for the individual replicates because 717 at the permissive settings (-p 0.6 or -p 0.3) it called the entire genome as a peak, so we 718 chose the third most permissive setting instead. The peaks, called by each method, were 719 then sorted by their respective scores (-log(p-value) for F-seq, the signal value for MACS2 720 and HMMRATAC) and the top 100k peaks were retained. These top peaks were the inputs 721 into the IDR software. 722 723 For calculating the 500 to 50,000 cell per-replicate reproducibility, each method was run 724 using the merged file(s) described previously, with the respective parameters as described 725 previously. For each method, each pair of resulting peak files, from the 500 cell per-726 replicate and 50,000 cell per-replicate datasets, were run through the IDR algorithm using 727 default parameters and each methods' respective score column as the rank. Reproducible 728 peaks between datasets were considered to have an IDR value equal to or below 0.05. 729 previously, using the individual replicate file(s) as their input. Each possible pair of 732 replicates was then run through the IDR algorithm with default parameters. For each 733 replicate pair, the number of peaks whose local IDR score was equal to or below 0.05 was 734 calculated. Peaks with IDR scores equal to or below 0.05 in each of the pairwise 735 comparisons were retained and used to calculate the overlap between the three methods 736 (Additional file 1: Fig. S8).  Because we can't reach the extreme sides of the curves while tuning cutoff values, before 804 we calculated AUC values, we added theoretical extreme points to ROC -( 0, 0) and ( 1, 805 1), and to PRC -( 0, 1) and (1, Ltrue/( Ltrue + Lfalse)). Ltrue is the total length of real positive 806 set, and the Lfalse is the total length of real negative set.