Human-specific CpG ‘beacons’ identify human-specific prefrontal cortex H3K4me3 chromatin peaks

Background : Targeted recruitment of chromatin-modifying enzymes to clusters of CpG dinucleotides contributes toward the formation of accessible chromatin. By interprimate comparison we previously identified the set of nonpolymorphic human-specific CpGs (CpG ‘beacons’) and revealed that these loci were enriched for human disease traits. Due to their human-specific CpG density change, extreme CpG ‘beacon’ clusters ( ≥ 20 CpG beacons/kb) were predicted to identify permissive chromatin peaks within the human genome. Aim: We set out to explore these sequence-defined regions for evidence of an active chromatin signature. Results : Using available comparative primate epigenomic data from neurons of the prefrontal cortex, we show that these CpG ‘beacon’ clusters are indeed enriched for being human-specific H3K4me3 peaks ( χ 2 : p < 2.2 × 10 -16 ) and thus predictive of permissive chromatin states. These sequence regions had a higher predictive value than previous selective analyses. We also show that both human-specific H3K4me3 and CpG ‘beacon’ clusters are increased within current and ancestral telomeric regions, supporting an association with recombination, which is higher towards the distal ends of chromosomes. Conclusion : Therefore, CpG-focused comparative sequence analysis can precisely pinpoint chromatin structures that contribute to the human-specific phenotype and further supports an integrated approach in genomic and epigenomic studies.


Background
Clusters of CpG dinucleotides, termed CpG islands (CpGis), act as a genomic platform for gene expression by providing an accessible environment for the binding of the basal transcription machinery [1,2]. They facilitate this setting, due to both their distinctive density and predominately unmethylated state, by recruiting specific factors that modify the three dimensional structure of the local chromatin [3,4]. Hence these islands reside within an expanse of repressed genome, forming active or transcriptionally competent promoters [5] and possess the characteristic permissive H3K4me3 histone signature [6]. This chromatin configuration is itself associated with expression, but also may further enable the recruitment of components required for this activity [7,8].
Due to this direct effect on 3D chromatin structure by CpG dinucleotides [5], and the potential for regulatory modulation it creates, we previously examined the primate-alignable portion of human genome for nucleotide sequence change that has led to the formation or maintenance of human-specific CpGs [9]. This was performed by sequence comparison of six primates (Enredo-Pecan-Ortheus [EPO] alignment including: human, chimpanzee, gorilla, orangutan, rhesus macaque and marmoset [10,11]). The resultant nonpolymorphic human-specific CpGs were termed human CpG 'beacons' and identify prospective novel human-specific regulatory regions. These CpG 'beacons' are now shown to be markedly enriched in the dynamic portion of the human DNA methylome [12]. We then Human-specific CpG 'beacons' identify human-specific prefrontal cortex H3K4me3 chromatin peaks identified 21 regions of extreme CpG 'beacon' clusters (empirical p < 1 × 10 -3 ) within the human genome. These were shown to colocate with four causative genes in monogenic mental/developmental disorders (ANKRD11 [13], CHL1 [14], EHMT1 [15] and VLDLR [16,17]), and genes implicated with complex traits such as autism and psychiatric disease (ANKRD11 [18], DLGAP2 [19] and DPP10 [18,20]), as well as reidentifying the noncoding RNA, HAR1A [21], implicated in cortex development, by precisely hitting its CpGi. These loci were significantly statistically enriched for neurological trait-related genes. As methylation is inversely correlated with CpGi density [22,23], we were also able to indicate the functional consequences of species-specific CpG density change, through comparative DNA methylome analysis in human, chimpanzee and macaque, with expected changes in methylation in orthologous CpGi.
A recently published comparative chromatin analysis for human-specific H3K4me3 within the prefrontal cortex (PFC) between human, chimpanzee and macaque identified 410 loci with a human-specific peak gain [24]. This dataset provided the ideal opportunity to test whether the hypothesized structural effect of CpG 'beacon' clusters did in fact lead to humanspecific chromatin modification. Furthermore, these data are derived from a cortical region fundamental in human-derived higher functioning [25,26].
Therefore, we tested this hypothesis and identified that CpG 'beacon' clusters are indeed enriched for these human-specific accessible chromatin regions. Furthermore they are increasingly likely to precisely locate the most significant fraction of these peaks. This supports the idea that human species-specific CpG 'beacons' have played a critical role in human evolution, as they locate significant regions of humanspecific functional epigenomic change in the genome, and that CpG-focused comparative sequence analysis between the closely related primate family can be highly informative.

Data
Human-specific H3K4me3 PFC peak locations were taken from the supplementary data (Supplementary  Table S3) of Shulha et al. [24].
Location of human Chip-seq H3K4me3 PFC peaks were accessed from Maunakea et al. [27], via the UCSC Genome Browser Table Browser: Regulation; UCSF Brain DNA Methylation; ucsfChipSeqH3K-4me3BrainCoverage. The sequencing reads are available through the NCBI SRA (study accession number SRP002318) [28]. The CpG 'beacon' data is browser viewable within UCSC [29].

Analysis
All statistical calculations and simulations were performed in R [30]. Wilcoxon test was used for hs-H3K4me3 and CpG 'beacon' comparison between loci with 0 CpG 'beacons' and those with ≥20, using the FDR p-values calculated by Shulha et al. for human versus chimpanzee and macaque regions [24]. For this FDR comparison analysis, those loci with an FDR p = 0 were capped at a p = 1 × 10 -200 . Genome intersect comparisons were via the BEDTools package command intersectBed [31]. Genomic representation plot was constructed with Circos [32]. Genomic enrichment calculations performed with GREAT, by the region-based binominal method, with regional default for associated genomic regions, but with extension reduced to 100 kb [33]. A locus was considered to reside in a distal, or telomeric region, if it overlapped the first or last chromosome band (coordinates via Ensembl Biomart). All human genomic coordinates given are in the GRCh37/hg19 build.

Results
Primate comparative sequence analysis predicts human-specific chromatin state A set of 410 human-specific peaks of the activating chromatin modification, H3K4me3, were identified from a total of 34,639 human peaks, by a comparative analysis of the PFC of human, chimpanzee and macaque by Shulha et al. [24]. Within the 21 extreme CpG 'beacon' peak regions we previously identified [9], available data from Maunakea et al. had shown that 8 possessed H3K4me3 PFC peaks [9,27]. Comparison of these 8 peaks with the Shulha et al. dataset revealed 5 to be identified as human-specific H3K4me3 (hs-H3K4me3) peaks. Therefore the H3K4me3 signal in these CpG 'beacon' clusters is strongly enriched for being present only in human and not in chimpanzee or macaque (χ 2 : p < 2.2 × 10 -16 ). These five hs-H3K4me3 peaks were precisely located by extreme CpG 'beacon' clusters (≥20 CpG 'beacons'/kb) within or near the genes; ANKRD11, CHL1, DPP10, HAR1A and PSD4 (see Randomization via ShuffleBed [31] and genome structure correction (GSC) enrichment [34,35] were highly significant (both p < 1 × 10 -5 ). However, both the extreme CpG 'beacon' clusters and H3K4me3 peaks are not likely to be found with equal chance throughout the entire genome, but are highly likely to reside within CpG-dense regions. To account for this, we permutated the 21 loci, representing random extreme CpG clusters, only from the total of 18,665 human PFC H3K4me3 peaks, out of the total of future science group www.futuremedicine.com 37,902 (identified from Maunakea et al. [27], as the total set genome co-ordinates are not available from Shulha et al. [24]) that overlap CpG-dense regions (22,374 Ensembl CpGis). With 10,000 permutations this did not exceed the observed five peaks (average: 0.466 ± 0.677 overlapping peaks per genome simulation, empirical p < 1 × 10 -5 ). Therefore while as much as ∼83.4% of brain PFC CpGi's possess the H3K4me3 signature, only ∼1.8% have human-specific H3K4me3, whereas by contrast, the CpG 'beacon' clusters are statistically strongly enriched with ∼23.8% having human-specific H3K4me3 (five out of 21 clusters).

Further statistical overlap at lower CpG cluster thresholds
CpG 'beacon' H3K4me3 overlaps were not only restricted to the most extreme clusters (≥20 CpG 'beacons'/kb) but also found at lower levels. At a CpG 'beacons' density level of 15/kb, 11 out of a total of 63 cluster regions overlap with hs-H3K4me3 (simulation average: 1.387 ± 1.164) and 20 out of a total of 254 cluster regions overlaps at CpG 'beacons' density level of 10/kb (simulation average: 5.604 ± 2.347). Permutation analysis therefore for both of these results are highly statistically unlikely (all empirical p < 1 × 10 -5 ).
When we initially calculated the set of the extreme CpG 'beacon' clusters we took a conservative genomewide empirical cut-off of p < 1 × 10 -3 , which was reached at a density level of 20/kb CpG 'beacons'. The permutation however does not become nonsignificant (p > 0.05) until a density level of <17/kb (p = 0.210) and furthermore even at the 17/kb level the binominal genomic enrichment results via GREAT [33] for cognition is significant (p = 7.6 × 10 -6 , FDR Q = 3.36 × 10 -2 ). The H3K4me3 peak overlap at 17/kb is 7, out of a possible 36, (empirical p < 1 × 10 -5 , see Table 1). While the lower CpG 'beacon' density levels of 15/kb and 10/kb are not statistical outliers in their own right, this still does not negate their potential functional importance, and the enrichment for hs-H3K4me3 in these loci strongly supports this.
Additionally, the defined CpG 'beacons' were identified only within the ∼80% of the human genome with rigorous primate homology from the EPO sequence block alignments [10,11] that included, at least, human, chimpanzee and one other primate sequence (abbreviated as the 'h1c1o1' genome). Of the total 410 hs-H3K4me3 peak regions, only 257 lie within this majority of the genome, therefore, an under-representation (∼20% loss genome but 34.15% reduction in peaks; χ 2 : p < 2.2 × 10 -16 ). Thus the primate nonhomologous regions contain more identified hs-H3K4me3 peaks. This may be due to stronger genetic divergence within these regions having a confuture science group Human-specific CpG 'beacons' identify human-specific prefrontal cortex H3K4me3 chromatin peaks Research Article   [24], we find that the most extreme CpG 'beacon' clusters that overlap with hs-H3K4me3 peaks have a significantly higher likelihood of being truly human-specific peaks. These loci have more significant FDR p-values in comparison with both potential chimpanzee and macaque loci (loci with CpG 'beacons' = 0, compared with those with ≥20, Wilcoxon p = 3.57 × 10 -3 for human vs chimpanzee and p = 2.50 × 10 -2 for human vs macaque, see Figure 3A & B). Consequently, while the 410 peaks have variable certainty as to being truly human-specific, as represented by these FDR values, the most likely hs-H3K4me3 lie within extreme CpG clusters. Therefore the conservatively calculated enrichment within these hs-H3K4me3 would be even stronger if false-positive hs-H3K4me3 peaks were excluded. Additional datasets from the other closely related primates, including orangutan and gorilla, would aid the reduction in Outer ring: chromosomes with banding patterns. Dense edge ring: Enredo-Pecan-Ortheus [10,11] h1c1o1 coverage of these chromosomes. Blue peaks: extreme CpG 'beacon' peak locations. Human-specific H3K4me3 location with false discovery rate log p-values compared to chimpanzee (orange), macaque (green) and overlapping chimpanzee and macaque (olive). Location of ANKRD11, CHL1, DPP10, HAR1A and PSD4 indicated with colocating CpG 'beacon' clusters and human-specific H3K4me3. The highlighted yellow wedge is the location of the chromosome 2qFus (2q13-2q14.1) with enrichment of both extreme CpG 'beacon' clusters and of human-specific H3K4me3 peaks (χ 2 : p = 1.045 × 10 -5 ).
false-positive human-specific peaks, however these are currently not publicly available.

Double peaks & telomeric bias
Peaks of hs-H3K4me3 were also identified by Shulha et al. to occur in pairs or groups within 0.5-1 Mb, with an approximately two-to three-fold enrichment. Recent ENCODE data has identified an average peak region of long range interaction of promoters, enhancers and CTCF to be approximately 120 kb upstream of the TSS [36]. This grouping propensity is also seen within the CpG 'beacon' overlap peaks (i.e., DPP10 future science group Human-specific CpG 'beacons' identify human-specific prefrontal cortex H3K4me3 chromatin peaks Research Article

CpG 'beacons' CpG 'beacons'
and EXOC2, among others; Supplementary Table 1). These double peaks had an average genomic distance between them of ∼126 kb. Shulha et al. showed evidence of these close double peak regions having physical interaction by ChIA-PET. ChromHMM segmentation analysis [37] found that colocating with H3K4me1, a high level of H3K4me3 (61-80%) was also in Strong Enhancer Type 4, thus potentially implicating some of these regions in this role. We previously identified a distal chromosome or telomeric bias in CpG 'beacon' clusters (CpG 'beacon' cluster 20/kb = 52.3% in terminal chromosome bands [9]), due to the role recombination-associated biased gene conversion (BGC) plays in their formation. This being the biochemical bias in the DNA repair mechanism that occurs in heteroduplexed DNA, formed during crossing over, leading to clustered weak-to-strong 'biased' substitutions (AT to GC) [38,39]. Therefore, we assessed whether hs-H3K4me3 also have a telomeric bias which could be contributing to these peak groupings. We calculated there is in fact a >twofold enrichment within these telomeric regions compared to all human H3K4me3 peaks (χ 2 : p < 2.2 × 10 -16 ).
We also previously identified a strong clustering bias of CpG 'beacon' clusters, within the historic chromosome 2qFus region, due to past recombination in these archaic telomeric regions [40]. Examining the hs-H3K4me3 peaks, these also show an enrichment for this 2qfus region (2q13-2q14.1 -chr2: 110,200,001-118,800,000) with an average 0.699 peaks/Mb compared with ∼0.131 peaks/Mb across the genome and an approximately sixfold enrichment over the expected hs-H3K4me3 compared to all the human H3K4me3 peaks present here (χ 2 : p = 1.045 × 10 -5 ; see Figure 2). Hence this further supports the model of recombination-associated BGC increase in CpGs leading to the formation of H3K4me3 peaks. Cohen et al. also recently proposed BGC islands, as one of the categories of CpG islands, due to their formation by this process [41].
CpG 'beacons' predict human-specific H3K4me3 peaks greater than other comparative sequence change Shulha et al. compared the identified hs-H3K4me3 peaks with known regions of accelerated humanspecific sequence change, such as the human accelerated regions (HARs) [21] and found only minimal overlap (1/49 loci = HAR1A), as well as nine further studies of positive selection collated together by Akey [42], which also showed low intersection. Due to this inadequate overlap they concluded that comparative sequence analysis was poorly informative of potential functional epigenomic differences [24] and thus that important changes in chromatin structure and func-tion cannot be identified by comparative genomic analysis alone. Furthermore, we also compared the hs-H3K4me3 regions with the human accelerated conserved noncoding (HACNS -0/992 loci) [43] and accelerated noncoding (ANC -6/1356 loci) [44] sets and these also show minimal colocation. However, as shown, the specific CpG dinucleotide change, indicated by CpG 'beacons' through six primate comparison, enabled a far better prediction of hs-H3K4me3 than any of these methods. Therefore important changes in chromatin structure and epigenomic function can be identified by specific comparative genomic analysis.
We simulated these data by permutation with random selections sets of one to 1000 loci from the total of 18,665 human peaks (1000 × randomization each) that then overlap the 410 hs-H3K3me3 and plotted these results on a scatter plot also with the observed overlaps for all these mentioned studies (blue squares) and the CpG 'beacon' data (red and green Squares) (see Figure 4). This clearly shows that while the selection studies perform either the same or worse than the simulation values, the CpG 'beacon' clusters overlap is far in excess of the maximum simulation results (all empirical p < 1 × 10 -5 ), with those adjusted for the reduced numbers within the primate EPO alignment performing the best (green squares).
The only study that approaches the success of the CpG 'beacons' is the expanded set of 202 HARs (∼85.6 kB), as while the top 49 HARs perform poorly, when this set is extended further these loci begin to capture more CpG 'beacon' clusters, such that ∼85.7% of the hs-H3K4me3-overlapping HAR_202 regions are CpG 'beacon' clusters of ≥10/kb (12 out of 14). This overlap, as discussed previously, indicates the strong role of BGC in the formation of both CpG 'beacon' clusters and regions of extreme species-specific dinucleotide divergence in the human genome, driven by recombination [9,38]. Thus, further indicating how this process, with a by-product of modulating sequence, can affect genomic 3D structure in these extreme locations.
The simulation also clearly displays why the comparison against the 9 positive selective studies as used by Shulha et al. [24] is inappropriate to identify human-specific change, as these studies use human population variation and linkage disequilibrium data that is too recent to identify significant genetic changes that have led to variation between primates. Furthermore, the other comparative studies lose power in comparison with the 'beacon' analysis, as they regard all nucleotide change as equal, are focused wider than an interprimate analysis, and in those that center exclusively on nontranscript overfuture science group lapping sequence, such as the ANC regions [44], may exclude many promoter hs-H3K4me3 regions that reside 5´ within first exons, or intragenic promoter CpGi loci. The inaccuracy of using human population data can be starkly seen by the fact that the genome space associated with two or more selective studies from Akey is 244.7 Mb, compared with only 29.9 kb for the extreme CpG 'beacon' ≥20/kb peaks regions, or ∼0.01% of this, and yet the latter has 23.8% successful overlap with hs-H3K4me3 peaks versus 2.36% of the former.

Discussion
DNA methylation is essential for mammalian development and is almost exclusively restricted to the CpG dinucleotide in differentiated cells [45]. Therefore, this epigenomic mechanism requires the presence of the CpG genetic sequence template in order to facilitate the modification's dynamic functionality. Although a lower level of non-CpG cytosine methylation has been identified to occur in the human brain, predominately at CACC motifs [46]. We previously identified, by interprimate comparative analysis, human-specific CpG 'beacon' clusters and found these to be enriched for human disease and phenotype loci. The facilitated epigenetic role that these clusters of CpGs provide is not restricted to DNA methylation, but may also be involved in enabling other modifications of DNA, including hydroxymethylation [47]. Furthermore, we predicted that these clusters would influence chromatin structure, due to their significant human-specific change in CpG density. Using the PFC hs-H3K4me3 data from Shulha et al. [24] we have shown strong evidence of this. This sequence relationship is only associated with H3K4me3, and presently only tested in PFC, but further analysis, when other comparative epigenomic tissue signatures become available, may be revealing.
Mechanisms have evolved to reduce the local functional impact of transposable elements, such as Alu elements [48], which while possessing high CpG content, are usually methylated [49]. However sequence divergence of homologous loci through highly dissimilar species-specific recombination [50], driven by the rapidly evolving recognition motif of PRDM9 [51,52], can lead to BGC-associated increase in local GC content [53]. This can concurrently increase CpG dinucleotides and form increasingly dense [54] or new CpGi, within these hotspots [55], even without requiring the force of positive selection [41]. These CpGi regions, recruit chromatin-modifying enzymes, such as Cfp1, as has been experimentally shown by CpG island inclusion experiments in a mouse model by Thomson et al. [3]. This, in turn, leads to the targeted formation of acces-sible H3K4me3 chromatin structure in the appropriate genomic regions [56]. This chromatin signature may not itself be causative in expression, but is indicative of a permissive promoter formation. Furthermore independent of transcription status, both CpG content and CpGi width correlate with local nucleosome depletion [57].
Dreszer et al. has previously identified outlying regions of high BGC with loci of extreme sequence diverge, and these regions also have a telomeric or distal bias, due to the higher level of recombination that occurs there [38,58]. CpG 'beacon' clusters also share these properties [9] and thus link recombination with H3K4me3 formation (see Hypothesis in Figure 5). Therefore the extreme CpG increase identified in future science group Human-specific CpG 'beacons' identify human-specific prefrontal cortex H3K4me3 chromatin peaks Research Article CpG 'beacon' clusters identify hs-H3K4me3 peaks more successfully than extreme human nucleotide change or studies of selection. Scatter plot of permutated data for random selection of sets of one to 1000 loci from the total of 18,665 human peaks (1000× randomization each) that then overlap the 410 hs-H3K3me4 loci. Upper line: Loess regression line of maximum simulation value; golden points: average for each random selection with average Loess regression line overlaid in grey; red squares: CpG 'beacon' clusters; green squares: CpG 'beacon' clusters adjusted for the fact that the primate EPO alignment only contains 257 hs-H3K4me3 peaks; blue squares: selective studies -HACNS (HACNS -0/992) [43] and ANC (ANC -6/1356) [44], Select_x2, _x3 and _x4+ are overlaps with studies combined by Akey [42] with loci colocating in 2, 3 or 4 or more studies. HARs are the 49 top HARs [21]. HAR_202 is an expanded set of 202 HARs, with this set the only nonbeacon analysis to perform well, as this larger grouping has a strong overlap with CpG 'beacons' (∼85.7% of the hs-H3K4me3 overlapping CpG 'beacons' ≥10/kb). ANC: Accelerated noncoding; CB_10: CpG 'beacon' clusters ≥10/kb; CB_15: CpG 'beacon' clusters ≥15/kb; CB_20: CpG 'beacon' clusters ≥20/kb; HAR: Human accelerated regions; HACNS: Human accelerated conserved noncoding; hs-H3K4me3: Human-specific H3K4me3.
human, within these loci, is not due to severe concurrent loss of CpGs in all of the other primates. Thus we have shown by our primate-specific comparison that these sequence-directed or associated epigenetic changes, identified in murine models, in fact occur in human, and furthermore within the significant brain tissue of the human PFC.
Comparing DNA sequence can identify humanspecific epigenomic variation that can increase the understanding of brain function unique to humans. Epigenetics complements genetics, and is not in conflict with it, justifying an integrated approach in genomic and epigenomic analyses for disease studies [59][60][61]. Structural effects on the genome through chromatin modification can in fact be deduced by direct sequence comparison, especially when the importance of species-specific CpGs 'beacons' in the formation of H3K4me3 is acknowledged and the additional power of comparison between multiple closely-related primates is employed.

Conclusion
We have shown that CpG 'beacon' clusters can precisely locate hs-H3K4me3 and, furthermore, are more enriched in the strongest human-specific peaks. This supports our work that the most extreme CpG 'beacon' peaks map to tissue-invariant peaks of accessible H3K4me3 chromatin, specific only to the human genome, and implicated in the formation of the human phenotype [9].

Future perspective
The integration of genomic, epigenomic and transcriptomic sequence data will enable an increasingly comprehensive understanding of biology. Rapid advances in sequencing technology will speed this powerful combinatorial analysis and facilitate detailed exploration of the evolution of malignant, population and species-specific variation. Increased understanding of how genetic sequence change, such as in CpGs and transcription factor motifs, may be associated with active or passive epigenomic alterations will aid deciphering these data. Furthermore, through these methods, the identification of functional differences, between humans and other closely related primates, will help explain the human-specific phenotype and increase the understanding of human pathophysiology. The histone methylase PRDM9 directs recombination via motif-specificity, with resultant biased gene conversion within these regions increasing GC content and therefore CpG numbers. Dense CpG regions are able to recruit chromatin-modifying factors such as Cpf1, which lead to the formation of permissive H3K4me3 chromatin domains. Due to the strong association of PRDM9 with recombination [51], ancestral alleles of this rapidly evolving DNA-binding factor [52] are hypothesized to be implicated in this recombination-associated biased gene conversion [39].