Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA

Despite critical roles in chromosome segregation and disease, the repetitive structure and vast size of centromeres and their surrounding heterochromatic regions impede studies of genomic variation. Here we report the identification of large-scale haplotypes (cenhaps) in humans that span the centromere-proximal regions of all metacentric chromosomes, including the arrays of highly repeated α-satellites on which centromeres form. Cenhaps reveal deep diversity, including entire introgressed Neanderthal centromeres and equally ancient lineages among Africans. These centromere-spanning haplotypes contain variants, including large differences in α-satellite DNA content, which may influence the fidelity and bias of chromosome transmission. The discovery of cenhaps creates new opportunities to investigate their contribution to phenotypic variation, especially in meiosis and mitosis, as well as to more incisively model the unexpectedly rich evolution of these challenging genomic regions.

content, which may influence the fidelity and bias of chromosome transmission. The discovery of cenhaps creates new opportunities to investigate their contribution to phenotypic variation, especially in meiosis and mitosis, as well as to more incisively model the unexpectedly rich evolution of these challenging genomic regions. 25

One Sentence Summary:
Genomic polymorphism across centromeric regions of humans is organized into large-scale haplotypes with great diversity, including entire Neanderthal centromeres. 30

Main Text:
The centromere is the unique chromosomal locus that forms the kinetochore, which interacts with spindle microtubules and directs segregation of replicated chromosomes to daughter cells (1). Human centromeres assemble on a subset of large blocks (many Mbps) of highly repeated (171 bp) α-satellite arrays found on all chromosomes. These repetitive arrays and the flanking 5 segments, together the Centromere Proximal Regions (CPRs), play critical roles in the integrity of mitotic and meiotic inheritance (2). In somatic tissues, chromosome instability, including loss and gain of chromosomes, plays large and complex roles in aging, cancer (3), and human embryonic survival (4). Sequence variation in CPRs can affect meiotic pairing (3,4), kinetochore formation (5,6) and nonrandom segregation (3,4,6). Aneuploidy in the germline, typically 10 arising during meiosis, is a large component of genetic disease (9). Further, the unique asymmetry of transmission in female meiosis, where only one parental chromosome is transmitted, presents the opportunity for the evolution of strong deviations from mendelian segregation ratios (meiotic drive) (10,11). The large selective impact of recurrent meiotic drive is one potential cause of the evolutionarily rapid divergence of satellite DNAs and centromeric 15 chromatin proteins (12), reduced polymorphisms in flanking regions and high levels of aneuploidy (13). The challenges inherent to assessing genomic variation in these repetitive and dynamic regions remain a significant barrier to incisive functional and evolutionary investigations. 20 Recognizing the potential research value of well-genotyped diversity across human CPRs, we hypothesized that the low rates of meiotic exchange in these regions (14) might result in large, haplotypes in populations, perhaps even spanning both the α-satellite arrays. To test this, we examined the Single Nucleotide Polymorphism (SNP) linkage disequilibrium (LD) and haplotype variation surrounding the centromeres among the diverse collection of genotyped 25 individuals in Phase 3 of the 1000 Genomes Project (15). Figure 1a depicts the predicted patterns of strong LD (red) and associated unbroken haplotypic structures surrounding the gap of unassembled satellite DNA of a metacentric chromosome. Unweighted Pair Group Method with Arithmetic Mean (UMPGA) clustering on 800 SNPs immediately flanking the chrX centromeric gap in males (Fig. 1c) reveals a clear haplotypic structure that spans the gap and 30 extends, as predicted, to a much larger region (≈7 Mbp, Fig. 1b). Similar clustering of the imputed genotypes of females also falls into the same distinct high-level haplotypes (Fig. S1).
This discovery of the predicted haplotypes spanning CPRs (hereafter referred to as cenhaps) on chrX and all the metacentric chromosomes (Fig. S2) opens a new window into their evolutionary history and functional potential. 35

3
The pattern of geographic differentiation across the inferred chrX CPR (Fig. 1) exhibits higher diversity in African samples, as observed throughout the genome (15). Despite being fairly common among Africans today, a distinctly diverged chrX cenhap (cenhap 1, highlighted in purple, Fig. 1b,c) is rare outside of Africa. Examination of the haplotypic clustering and estimated synonymous divergence in the coding regions of 21 genes included in the chrX cenhap 5 region (see Table S1) yields a parallel relationship among the three major cenhaps and an estimated Time of the Most Recent Common Ancestor (TMRCA) of ≈700 KYA (Fig. 1d) for this most diverged example. While ancient, putatively introgressed archaic segments have been inferred in African genomes (16,17), this cenhap stands out as genomically (if not genetically) large. The persistence of such ancient cenhaps is inconsistent with the predicted hitchhiking 10 effect of sequential fixation of new meiotically driven centromeres (12). Further, the detection of near-ancient segments spanning the centromere contrasts with the observation of substantially more recent ancestry across the remainder of chrX and with the expectation of reduced archaic sequences on chrX (18). A large block on the right in Fig. 1b, where recombination has substantially degraded the haplotypic structure, is comprised of SNPs in exceptionally high 15 frequency in Africans. Its history in "anatomically modern humans" (AMH) may be shared with the ancient cenhap in Africa. Many distal recombinants are observed outside of Africa that likely contribute to associations of SNPs in this region with a diverse set of phenotypes, including male pattern hair loss and prostate cancer (19,20) . 20 This deep history of the chrX CPR raises the possibility of even more ancient lineages on other Ancestral in the Neanderthal, is a measure of the proportion of the cenhap lineage shared with 35 Neanderthals and further supports the conclusion that this chr11 cenhap is an introgressed CENHAPS 4 archaic centromere. Fig. 2b shows these mean counts for each SNP class by cenhap group, confirming that the affinity to Neanderthals is slightly stronger than to Denisovans. A second basal African lineage separates shortly after the Neanderthal (cenhap 2, highlighted in purple, Fig. 2a). It is unclear if this cenhap represents an introgression from a distinct archaic hominin in Africa or a surviving ancient lineage within the population that gave rise to AMHs. 5 The relatively large expanses of these cenhaps and unexpectedly sparse evidence of recombination could be explained by either relatively recent introgressions or cenhap-specific suppression of crossing over with other AMH genomes in this CPR (e.g., an inversion). As with chrX above, the clustering of cenhaps based on coding synonymous SNPs (Fig. 2d) (Table S2), this cenhap likely codes for Neanderthal-specific determinants of smell and taste.
Similarly, in the second ancient cenhap found primarily within Africa (2), eight of these ORs harbor 14 amino acid replacements, of which only two are shared with cenhap 1 (see Table S2).
The frequencies of the Neanderthal cenhap in Europe, South Asia and the Americas (0.061, 0.032 and 0.033, respectively), and of second ancient cenhap in Africa (0.036), are sufficiently 20 high that together they contribute more than half of the amino acid replacement diversity in these 34 ORs among the 1000 Genomes (see Table S5). Thus, a substantial part of the variation in chemical perception among AMH may be contributed by these two ancient cenhaps.
The most diverged, basal clade on chr12 (Fig. 2c, indicated in brown) is common in Africa, but, 25 like the most diverged chrX cenhap, is not represented among the descendants of the out-of-Africa migrations (26). The great depth of the lineage of this cenhap is further supported by comparison to homologous archaic sequences (21,22,23). Consistent with the hypothesis that this branch split off before that of Neanderthals/Denisovans, members of this cenhap share fewer matches with derived SNPs on the Neanderthal and Denisovan lineages (DM) and exhibit 30 strikingly more ancestral non-matches (AN) than other chr12 cenhaps (see Fig. 2b). This putatively archaic chr12 cenhap represents a large and obvious example of the genome-wide introgressions into African populations inferred from model-based analyses of the distributions of sequence divergence (16,17). The small out-of-Africa cenhap nested within a mostly African subclade (indicated in blue in Fig. 2c) appears to be a typical Eurasian archaic introgression with 35 higher affinity to Neanderthals (DM/(DN+DM) = 0.91 and DM/(DM+AN) = 0.90) than to CENHAPS 5 Denisovans (Fig. 2b). This bolsters the conclusion that the basal African cenhap represents a distinct, older and likely introgressed archaic lineage. Unfortunately, there are too few coding bases in this region to support confident estimation of the TMRCAs of these ancient chr12 cenhaps. Based on the numbers of SNPs underlying the cenhaps, this basal cenhap is twice as diverged as the apparent introgressed Neanderthal cenhap, placing the TMRCA at ~1.1 MYA, 5 assuming the Neanderthal TMRCA was 575KYA (23). While there is no direct evidence of recent introgression, the large genomic scale of this most diverged cenhap (relative to apparent exchanges in other cenhaps) is consistent with recent admixture with an extinct archaic in Africa, although, again, suppression of crossing over is an alternative explanation.  Fig. 1c show substantial differences (Fig. 3a). α-satellite array sizes in cenhap-homozygous females are  Fig. 1b, Fig.   3b and Fig. S4), it is a potential explanation for particular instances of cenhaps with small 35 estimated array sizes, e.g., the relatively low chrX-specific α-satellite content in the highly The potential impact of sequence variation in CPRs and their associated satellites on centromere 5 and heterochromatin functions has been long recognized but difficult to study (10). Both binding of the centromere-specific histone, CENPA (30) and kinetochore size (8) are known to scale with the size of arrays and to fluctuate with sequence variation in satellite DNAs (7). Through these interactions with kinetochore function and other roles for heterochromatin in chromosome segregation (3,31), α-satellite array variations can affect mitotic stability in human cells (32), as 10 well as meiotic drive systems in the mouse (11). Meiotic drive has been cited as the likely explanation for the saltatory divergence of satellite sequences and the excess of nonsynonymous divergence of several centromere proteins, some of which interact directly with the DNA (12).
However, the high levels of haplotypic diversity and deep cenhap lineages we observe (Fig. S2) conflict with the predictions of a naïve turnover model based on strong directional selection 15 yielding sequential fixation of new driven centromeric haplotypes. The inherent frequencydependence of meiotic drive (34), associative overdominance (33), a likely tradeoff between meiotic transmission bias and the fidelity of segregation of driven centromeres (13), and the expected impact of unlinked suppressors (34), are plausible explanations for the surprising levels of cenhap variation. 20 The identification of human cenhaps raises new questions about the evolution of these unique genomic regions, but also provides the resolution and framework necessary to quantitatively address them. Our results transform large, previously obscure and shunned genomic regions into genetically rich and tractable resources, revealing unexpected diversity and immense archaic 25 centromere introgressions. Most importantly, cenhaps can now be investigated for associations with variation in evolutionarily important chromosome functions, such as meiotic drive (35) and recombination (14) , , as well as disease-related functions, such as aneuploidy in the germline (9) and in development (4), cancer and aging (3).  Table S2), assuming the TMRCA of humans and chimpanzee is 6.5MY (see Methods and legend for Fig 1d).