Annotation of suprachromosomal families reveals uncommon types of alpha satellite organization in pericentromeric regions of hg38 human genome assembly

Centromeric alpha satellite (AS) is composed of highly identical higher-order DNA repetitive sequences, which make the standard assembly process impossible. Because of this the AS repeats were severely underrepresented in previous versions of the human genome assembly showing large centromeric gaps. The latest hg38 assembly (GCA_000001405.15) employed a novel method of approximate representation of these sequences using AS reference models to fill the gaps. Therefore, a lot more of assembled AS became available for genomic analysis. We used the PERCON program previously described by us to annotate various suprachromosomal families (SFs) of AS in the hg38 assembly and presented the results of our primary analysis as an easy-to-read track for the UCSC Genome Browser. The monomeric classes, characteristic of the five known SFs, were color-coded, which allowed quick visual assessment of AS composition in whole multi-megabase centromeres down to each individual AS monomer. Such comprehensive annotation of AS in the human genome assembly was performed for the first time. It showed the expected prevalence of the known major types of AS organization characteristic of the five established SFs. Also, some less common types of AS arrays were identified, such as pure R2 domains in SF5, apparent J/R and D/R mixes in SF1 and SF2, and several different SF4 higher-order repeats among reference models and in regular contigs. No new SFs or large unclassed AS domains were discovered. The dataset reveals the architecture of human centromeres and allows classification of AS sequence reads by alignment to the annotated hg38 assembly. The data were deposited here: http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&hgt.customText=https://dl.dropboxusercontent.com/u/22994534/AS-tracks/human-GRC-hg38-M1SFs.bed.bz2.

Centromeric alpha satellite (AS) is composed of highly identical higher-order DNA repetitive sequences, which make the standard assembly process impossible. Because of this the AS repeats were severely underrepresented in previous versions of the human genome assembly showing large centromeric gaps. The latest hg38 assembly (GCA_000001405. 15) employed a novel method of approximate representation of these sequences using AS reference models to fill the gaps. Therefore, a lot more of assembled AS became available for genomic analysis. We used the PERCON program previously described by us to annotate various suprachromosomal families (SFs) of AS in the hg38 assembly and presented the results of our primary analysis as an easy-to-read track for the UCSC Genome Browser. The monomeric classes, characteristic of the five known SFs, were color-coded, which allowed quick visual assessment of AS composition in whole multi-megabase centromeres down to each individual AS monomer. Such comprehensive annotation of AS in the human genome assembly was performed for the first time. It showed the expected prevalence of the known major types of AS organization characteristic of the five established SFs. Also, some less common types of AS arrays were identified, such as pure R2 domains in SF5, apparent J/R and D/R mixes in SF1 and SF2, and several different SF4 higher-order repeats among reference models and in regular contigs. No new SFs or large unclassed AS domains were discovered. The dataset reveals the architecture of human centromeres and allows classification of AS sequence reads by alignment to the annotated hg38 assembly. The data were deposited here: http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&hgt. customText=https://dl.dropboxusercontent.com/u/22994534/AS-tracks/human-GRC-hg38-M1SFs. bed [3]. Each core is flanked by layers after layers of sequence formed by divergent monomeric or dimeric arrays devoid of homogeneous HORs [3][4][5][6]. These layers are composed of slightly different types of monomers and represent the "dead" remnants of the centromeres of our pre-great ape ancestors, which had no chromosome-specific HORs, but rather monomeric or dimeric AS identical in all chromosomes with the possible exception of the Y [3,6,7]. The farther from the "live" homogeneous core, the older and more divergent the dead layers are [6] and more signs of "post-mortal" damage such as deletions, inversions and insertions of mobile elements they display [3,6]. The structure of the flanking pericentromeric regions is more or less symmetrical and each specific layer is often present on both sides of homogeneous core, which performs the centromeric function and forms a kinetochore [6]. The dead divergent layers cannot function as a centromere, but form pericentromeric heterochromatin [8].
With the exception of the Y chromosome, functional HOR arrays can be classed into three "new" suprachromosomal families (SFs 1, 2 and 3), each residing on a number of chromosomes. The older non-HOR AS is divided into the two large groups SF5 and SF4. SF5 is evolutionarily younger and immediately ancestral to the new families. On most chromosomes it directly flanks the functional HOR arrays [3]. SF4 group contains all the older layers of non-HOR AS. Recently it has been subdivided into a number of SFs, most of which have not yet received formal names pending finalization of a new classification system. They are called dead AS layers and are color-coded [6]. Here we refer to the old SF4 as the SF4 + umbrella group, which includes the yellow layer (SF4 proper) and all the older layers defined in [6]. The new SF 1-3, SF5 and SF4+ groups are all composed of their own classes of monomers ( [3] and Table 1) recognizable by the PERCON program [7]. In this work, we do not annotate the colored layers within SF4 + (monomeric group M1 +), as their classification has yet to be completed.
In previous assemblies of the human genome, most of the HOR AS was absent and the core was occupied by a centromeric gap. In the latest hg38 assembly, the gap has been filled with so-called "reference models", which are somewhat arbitrary representations of AS HOR domains. Reference models are not real DNA sequences like traditional GenBank contigs, but instead are collections of all WGS reads, that match a certain HOR, put into a contig by the stochastic approach of using a generative Markov process, which is not expected to recreate the true long-range linear order across the entire array [1,9]. They can however be very helpful in mapping the AS deep sequencing or WGS reads to the human genome assembly.
Due to the complex pattern of intra-and inter-chromosomal identities in the pericentromeric regions of the acrocentric chromosomes 13, 14, 21 and 22, the mapping protocol used for the new assembly was apparently unable to determine which reference model belonged to which chromosome and what were the precise locations of the AS sequences on the chromosomes. Thus, all the HOR domains, which are present on at least one of these chromosomes, were put together in a single block, and this block was placed into the former centromeric gap on each chromosome. The same block of 13 reference models arranged in about the same order appears on all four chromosomes, but individual reference models have different names on every chromosome. Note that this block includes two live centromeres (paired domains 13/21 and 14/22), of which only one is actually present on any particular chromosome. Also, the identical sets of 3 AS reference models (of which only one is alive) appear on chromosomes 5 and 19 (paired domain 5/19), and the live model from this set also appears on chromosome 1 where the HOR is very similar to 5/19 paired domain and apparently cannot be distinguished by reference model assembly process (see Tables 2 and S1).

AS classification used by PERCON in the context of the human genome
AS was classed into five suprachromosomal families (SFs 1-3, SF4+ and SF5) according to monomeric classes in the sequence (Table 1), as described earlier [7]. Of those, SFs 1-3 are the new families of homogeneous HORs residing in functional centromeres in all autosomes and the X. In many chromosomes, on the periphery of the live HOR domain, much smaller domains formed by different new family HORs may also be present [3,10]. These could be the remnants of formerly functional centromeres, which have been recently replaced by other new family HOR domains and have been heavily deleted since their death. Such damaged dead centromeric domains are expected to appear on both sides of a live centromere and to be somewhat less homogeneous and less regular. On the other hand, they could be just occasional amplifications of a piece of AS, the HORs which have never had centromeric function which we termed pseudocentromeres. If such pseudocentromeric HOR has amplified a piece of AS residing in a segment duplication (SD) or a piece of some atypical border sequence, it may appear as an AS domain with unexpected location or composition. Also, if a piece of a damaged old centromere is amplified, it may once again appear as  The table summarizes the data reviewed in [3] and reported in [6]. a Used in this paper. b In live domains ancestral arrangement can only be observed or deduced from monomer order within a HOR unit.
homogeneous and regular as a live centromere. Minor HOR domains may belong to the same SF as the live HOR on a given chromosome (e.g. D18Z1 and D18Z2) or to a different SF (e.g. D1Z7 and D1Z5) [3,11]. Sometimes, a peripheral small HOR domain may contain just a slight variation of the live HOR. Such variants are usually 93%-97% identical to the main HOR. At least in one case (D17Z1-B [12]), such divergent variant has been demonstrated to be active as a centromere in some individuals (centromeric epiallele) [13]. The new families have only a very small proportion of non-HOR AS presumably represented by stray pieces and domain border sequences. SF4+ and SF5 groups are mostly formed by divergent non-HOR AS, which represents the dead centromeres of our primate ancestors [3,6,7]. SF5 is the youngest dead SF located distally, right next to homogeneous cores [2,3,9]. Usually, SF5 domains are formed by irregular alternation of R1 and R2 monomers and contain no HORs [14]. However, several exceptions were reported, such as low copy-number HOR domains on chromosomes 4, 7, 5, 19 and acrocentrics [15,16,21]. These low copy number HORs were perceived as occasional small scale amplifications in a recombination-prone tandem array of a dead centromere [15], i.e. pseudocentromeres that have never had centromeric function. However, it has to be tested if such HORs might occasionally play the role of transient short-lived centromeres. Here, we have found several more low copy number SF5 HORs among reference models (Table 3). Finally, SF4+ classification group represents all the more distal and older dead families. As described in [6], it contains SF4 proper, which is next and distal to SF5, pooled with all the other yet older and more distal families including SF6 and others which have not been named yet and are called dead layers and are color-coded (see Table 1). SF4+ sequences are divergent and contain no live HORs except for the relic Y chromosomespecific HOR family, which belongs to SF4 proper, reportedly one of the smaller functional HOR domains in human genome [2]. However, PERCON annotation revealed many more SF4+ HORs both among AS reference models and regular contigs (see Tables 2, S1, and S3 and Discussion section).
The 12 monomeric classes recognized by PERCON group into 2 ancestral types, A and B (see Table 1), which may be differentiated by several variable nucleotide positions in 35-51 region of the monomer (the A/B box) [14,17]. In the B type, this region binds the well studied CENP-B protein and in the A type it reportedly binds a certain pJalpha protein the identity of which has not been established.

PERCON program
The PERCON program (formerly various parts of it were called the PERCON, DIST and BREVN) was utilized for AS classification as described in [7]. It was executed in a number of steps. The first step was the identification of AS performed in two phases. Phase 1 implemented a fast database homology search with the aid of octanucleotide dictionaries. Briefly, the distance ro = f/f (random) between two fragments was calculated where f = log((N1 + N2) / 2nc); N1 and N2 were the sizes of octanucleotide dictionaries of the compared fragments; nc was the size of the intersection dictionary shared between the two fragments, and f (random) was the f value calculated for a random sequence. The ASC consensus monomer (see below) was used for generation of the N2 dictionary. The threshold value of ro = 0.6 was determined experimentally, above which no AS sequences retained in a sample. Ro = 0 for identical sequences. At phase 2, all the sequences, which cleared the threshold, were checked by an automatic dot matrix procedure. The program produced a quantitative parameter roughly reflecting the probability of a random generation of the best diagonal observed in a given dot matrix. The sequences which cleared a certain experimentally determined threshold were retained for further analysis.
The second step was AS monomer identification and SF classification. Monomers were identified by repeated alignment to the same AS consensus test sequence (ASC; shown in Fig. S1), a modified version of ALPHA-ALL consensus derived from consensus sequences of all 12 known monomeric classes [14]. Position 1 in the monomer by tradition was arbitrarily assigned to the first nucleotide of the BamHI site in chromosome X-specific HOR DXZ1 [18]. The N positions in the A/B box of ALPHA-ALL were set to the A configuration as shown in Fig. S1. The repeated alignment was performed by a modified Smith-Waterman-Gotoh algorithm [19,20] and was stopped when relative alignment score (rs) of the monomers obtained became 0.29 or less. rs was calculated as an alignment score divided by reward for a match multiplied by the length of alignment. The alignment score was the number of matches multiplied by reward for a match minus the number of gaps multiplied by a gap opening penalty minus the number of nucleotides in gaps multiplied by a penalty for gap extension minus a number of mismatches multiplied by a penalty for mismatch. At the very ends of the monomer the alignment was not always precise and small gaps of up to 5 bp often separated the adjacent monomers. This did not affect monomer classification. Next, every monomer was classed into one of the 12 known standard monomer classes (J1, J2, D1, D2, W1-W5, R1, R2 and M1 [3]) or defined as unclassed (Um) or random (Xm) by a simple Bayesian classification procedure that utilized consensus matrices of the 12 classes of monomers together with the random matrix (shown in Fig. S1). The program estimated the probability of a hypothesis that a given monomer belonged to one of the known monomeric classes and, if it met the threshold (typically 0.9), assigned the classification. Otherwise, the monomer was deemed "unclassed" (e.g. chimeric monomers where half belongs to one class and half to another class). Altogether, 14 groups were identified by PERCON. Long sequences were processed in consecutive 5 kb windows, which overlapped by about 200 bp. Independently, the region of the A/B box (positions 35-51) was classed in every monomer in the sequence (A box, B box, X for random, U for unclassed and Q if, in the truncated monomer, the box region was not present). Classification was performed by Bayesian classifier in the same way as for the whole monomers. The matrices for the A and B boxes were generated by summation of all the consensus matrices of individual monomeric classes that belonged to type A and type B (shown in Fig. S1). Note that the A/B classification did not assess the functionality of the B-box in CENP-B binding-it just determined to which ancestral type the monomer belonged. In some rare cases, the box classification and the monomer classification may contradict each other (e.g. R2 monomer which has a B-box). At least some studied instances of such monomers are hybrids with the box region coming from one class and the rest of the monomer from the other (data not shown).

UCSC Browser Track
The track was created by PERCON program developed by V.A. Shepelev and I.A. Alexandrov [7]. AS monomers were identified by PERCON similarity search, extracted and distributed into the classes characteristic of the 5 SFs by a Bayesian classifier. Program output contained detailed information on AS monomers, including monomeric class, result of independent typing of the A/B box, genomic coordinates and strand orientation, which were used for the annotation track. For incomplete monomers with length less than 140 bp, the monomer class was shown in lowercase letters; for longer monomers in uppercase letters. These data generated for hg38 human genome assembly were transformed into a Browser Extensible Display (BED) format suitable for viewing as a custom annotation track in the UCSC Human Genome Browser (http://genome.ucsc.edu/). To convert PERCON output to BED format we wrote an AWK script (available at https://github.com/ enigene/prcn2BED) which also color-codes the monomers according to the monomeric type. After using the prcn2BED, the resulting BED file was processed by a second AWK script (available at https://github. com/enigene/remisct) to remove duplicate segments resulting from overlap of the 5 kb windows. Of two overlapping monomers, the longer one remained intact and the shorter one was trimmed. The resulting annotation track is available at http://genome.ucsc.edu/cgi-bin/ hgTracks?db=hg38&hgt.customText=https://dl.dropboxusercontent. com/u/22994534/AS-tracks/human-GRC-hg38-M1SFs.bed.bz2. The track is self-explanatory, as in the full-view mode each monomer is marked with respect to its monomeric class and the A/B type according to Table 1.
By using the Table Browser, the track data can be analyzed in text format and filtered or transformed to generate various statistics. For instance, different classes of monomers can be counted per individual chromosome or chromosomal region. Also, the overlaps with other tracks can be created and retrieved as a new track, DNA sequence, or in text format.

Overall statistics of AS in hg38 assembly
The overall statistics of AS was collected from the UCSC Browser Track using Table Browser   RepeatMasker records that had no overlap with PERCON AS SF track constituted 3628 bp or 0.01% of total detection. The latter records were all small fragments shorter than one monomer. The size of DNA occupied by monomers of each SF determined in hg38 assembly is shown in Fig. 1 where it is compared to the data obtained in the analysis of WGS database. One million HuRef WGS reads (PRJNA19621) obtained by random DNA fragmentation were processed by PERCON in the same way as described previously for analysis of the BAC ends [7]. There was no dramatic difference in SF profiles between the sets, except the proportion of unclassed monomers in WGS (6%) was predictably higher than in the assembly (2.5%). Both a large number of truncated monomers at the ends of WGS reads and the low quality of sequence at the ends of Sanger reads contributed to this difference (Fig. S2). To correct for these factors we evaluated the effect of trimming the bad ends using the LUCY program with default parameters [22] and of filtering the dataset for monomers 140 bp or longer (such monomers are marked with upper case letters in PERCON output). The results are shown in Fig. S2, where it can be seen that each step reduced both the number of unclassed monomers and the total AS detection. To combine good detection with more accurate SF measurements we used the SF proportions obtained in the double-filtered sample to divide the AS amount obtained in the unfiltered sample. After such correction (see legend to Fig. S2 for more details) the proportion of unclassed monomers dropped to 2% and the WGS data appeared to be in fairly good concordance with the assembly. The only significant difference was a larger SF3 in the assembly. The reasons for this discrepancy we did not investigate. A more detailed discussion of the possible sources of unclassed monomers was provided in [7]. The proportions of SFs in human genome shown in Fig. 1 are, as much as we know, the first accurate estimate obtained in a non-biased sample. The results differ significantly from the ones in [7] which was our previous attempt to SF-class a large sample of AS fragments. The difference is presumably due to restriction enzyme bias in the older sample. The above statistics suggest that monomeric maps for most human AS sequences can be obtained without running PERCON, but simply by finding the same exact sequence or a very similar copy of it in the hg38 assembly.

Annotation of AS HOR reference models
For quick reference, we also provide tables of SF-annotated AS reference models as they appear in the assemblies of individual human chromosomes. Although annotation of individual HORs is beyond the scope of this work, we indicate the known live HORs, which are the largest reference models (except for the Y chromosome). Altogether 109 reference models are used in hg38 assembly (Table S1). After correction for identical models in different chromosomes, 66 unique models remain (Table 2). Of these, 18 unique models represent 22 live centromeres of autosomes, as chromosomes 13/21, 14/22 and 1/ 5/19 share the same live reference models. Two additional models represent live centromeres of sex chromosomes. All of them are classed in SFs as previously reported for respective live centromeres and can be recognized by high identity to the known centromeric sequences from the list shown in [3] (see Tables 2 and S1). Of the remaining 46 models, 24 contain variant live HORs, dead HORs or pseudocentromeres of the new families (SF1-3), and 10 contain SF5 HORs which were not known to form live centromeres and were traditionally perceived as pseudocentromeres. Previously, such HORs were reported on chromosomes 4, 5, 7, 19 and acrocentrics [15,16,21]. However, recent evidence shows that live centromeres of orangutan are likely formed by SF5 HORs [23], so some SF5 HOR domains may in fact be dead centromeres. Finally, 12 other unique pericentromeric reference models were classed as SF4 +. Except for the live SF4 + centromere of chromosome Y, SF4 + HORs were not widely reported in man. The only example known to us was the pTRA-2 AS clone [21] which was classed as SF4 [3] and shown to form a cluster of 75 HORs in the short arm of chromosome 21 [24] (see the Discussion section below). In particular, SF4 + HORs are present in chromosomes 15, 20 and the 13/14/21/22 group.
Due to some problem in the reference model assembly process, 5 unique reference models (marked in Tables 2 and S1) were assembled incorrectly, with a reverse order of monomers in a HOR (K. Miga, personal communication). The corrected versions of these reference models were obtained from K. Miga and used throughout our analysis.
Note that non-HOR AS is not supposed to appear in reference models.

Discussion
Inspection of the AS assembly track shows that PERCON adequately and comprehensively classes AS monomers and that they are organized predominantly into the major known modes characteristic of the known SFs. No long arrays of unclassed monomers are observed. SF1 and SF2 sequences are uniformly arranged in arrays with J1J2 and D1D2 dimeric periodicities, the remnants of W1W2W3W4W5 pentameric order can be discerned in SF3 sequences, and SF5 clusters demonstrate irregular alternation of R1 and R2 monomers. Although HORs are not annotated in the track, these repeats can be easily seen in SF3 and SF5 due to reiteration of complex patterns of W1-5 or R1 and R2 monomeric classes. In more dull dimeric sequences, HORs can often be seen due to some irregularity which occurs once or twice in a HOR, like the occasional Um, R1 or R2 monomer often found in SF2 HORs, or some other breach of dimeric periodicity like D1D1 dimer in D18Z2. In most cases, these structural peculiarities faithfully reflect the features of individual HORs reported in literature, but often the reiteration of HORs in reference models appears to be imperfect or even dramatically disturbed. Whether this reflects the true genomic arrangement or some problem in the algorithm of reference model formation remains to be investigated.
We noted a few previously unreported or poorly-reported minor modes of AS organization, as follows: (i) relatively long clusters For WGS dataset (1 million reads), the number of AS monomers identified by PERCON was multiplied by the average length of a monomer in this dataset (146 bp) and normalized to the genome size (shown as "HuRef raw"). The same amount of AS divided in proportions obtained in the same sample with bad ends trimmed and filtered for monomers 140 bp or longer (average monomer length 168 bp) is shown as "HuRef corrected" (see Fig. S2 for details). For the assembly, the length of all monomers identified by PERCON in each category was summarized directly from PERCON track using the Table Browser. In both datasets, the real amounts are slightly underestimated in a similar manner, as small gaps which PERCON often leaves between monomers due to imperfect alignment of the ends are not taken into account. composed entirely or almost entirely of R2 with very few or no R1; (ii) mixed occurrence of J1 and J2 monomers alternating with R1 and R2 over relatively long regions, or the same kind of pattern with D1D2/ R1R2 alternation; (iii) SF4+ HORs, which appear to be no less common than long known SF5 HORs. Below we will briefly comment on these unusual modes.
Pure R2 clusters were predicted by our scenario of SF5 formation and of the origin of the new families [3,6,7,14], but were actually found only recently due to their relative rarity [23]. We proposed that, in the common ancestor of orangutan and man, the centromeres were formed by pure R2 arrays and this was the last generation of type A centromeres in the human lineage. These centromeres, like all previous AS centromeres, had no CENP-B binding sites. At some point, R1 (type B) monomer, which had a CENP-B binding site, but was otherwise very similar to R2, formed due to several point mutations. The presence of CENP-B endowed this monomer with an ability to spread by irregular transposition-like process only within live R2 centromeres resulting in fixation of the B-box which ruined the regularity. These irregular AB centromeres were less efficient than regular ones and were rescued in African apes by amplification of AB type HORs of the new families, which restored regularity. In the orangutan branch, the regularity was restored by amplification of diverse SF5 HORs (different in different chromosomes), some of which were pure R2, some had both R1 and R2 in an irregular pattern, and some had R1R2 inner dimers [23]. The non-HOR pure R2 clusters listed in Table 3 are likely the pieces of pure R2 arrays which removed from the bulk centromeres due to some chromosomal rearrangements (and therefore died) before the R1 invasion and were thus immune to R1 insertion. The same logic suggests that the gradient of R1 density in human SF5 arrays documented in Table S2 reflects the gradual displacement and death of irregular SF5 domains and can be perceived as a clock timing the R1 invasion.
Presumably, the few R1-rich SF5 regions in the human genome (Table S2) are the ones which were functional (and thus exposed to R1 invasion) for the longest time. The large pure R2 HOR domain in chromosome 20 (GJ212117 and the adjacent regular contig FP565326) would hint that regular SF5 centromeres may have played some role in the early evolution of African apes before they were replaced by the new families. For such old, dead HORs, one would expect somewhat decreased homogeneity and the presence in chimpanzee and gorilla genomes. We found inter-HOR identity of 98.5% and no high identity hits in great ape WGS datasets (data not shown). Thus, the HOR is unlikely to be a relic of an era of SF5 centromeres, but is probably a more recent pseudocentromere which appeared via amplification of a piece of a dead pure R2 array.
Mixed SF1/SF5 and SF2/SF5 domains documented in Table 4 occupy relatively large space in the assembly because they are amplified in chromosomes 3 and 20, while non-HOR mixed regions are a tiny fraction. Potentially, three explanations of such mixes are feasible. They could be: (i) former pure SF1 or SF2 domains that were heavily deleted with formation of a large number of hybrid monomers (e.g. half D1, half D2) which are usually classed as Um or either R1 or R2, depending on configuration of the A/B box in the monomer (data not shown). Solitary (one per HOR) apparent SF5, but actually hybrid D1/D2 monomers are often present in SF2 HORs (data not shown); (ii) mixes of genuine new SF and SF5 monomers, which perhaps formed on the border of SF5 and live centromeres by recombination across the border, and; (iii) clusters formed by early versions of D and J monomers, which differed from their R type progenitors much less than the later "mature" versions that formed most of the HORs in respective SFs. (i) and (iii) should be regarded as "apparent" or "false" mixes and classed as respective new SF. (ii) is a "genuine" mix. All three types can potentially be HOR-amplified, which would greatly enlarge their genomic proportion. The (iii) seems to be true at least in mixed HOR arrays in chromosomes 3 and 20, as most of R1 and R2 monomers in respective HORs possess some of the mutations characteristic of J1 and J2 or D1 and D2, respectively (see detailed analysis in Supplementary note 1). This makes the arrays only "apparent mixes" and allows classing them as SF1 and SF2, respectively. High homogeneity and absence in African apes indicate that both mixed HOR arrays are likely pseudocentromeres. However, as both have divergent ancestral internal periodicities (see Supplementary note 1), these pseudocentromeres may have been formed by recent amplification of more ancient, dead HOR centromere material. Options (i) and (ii) may apply to other mixed domains which have no clear HORs. They warrant careful detailed investigation. So far, only two different non-orthologous live Y chromosome SF4+ HORs were known (DYZ3 in man and BACs AC195625, AC156580, AC163730, AC163738 in chimpanzee) [27,28]. Also, evidence for live SF4 + HORs in gibbons has been reported [6,29,30]. Additionally, the D21Z5 SF4 + HOR, also known as pTRA-2, (X55367, corresponds to GJ212124) was convincingly demonstrated on the sequence level [21,24]. The latter appears to be a segmental duplication also present on other acrocentric chromosomes. In the human assembly, in addition to DYZ3, we class 12 reference models as SF4+. Eight of them are part of the group assigned to chromosomes 13/14/21/22, but not all of them necessarily reside on all of these chromosomes. For instance, another member of this group, the SF2 D13Z1/D21Z1 live HOR forms the live centromeres of chromosomes 13 and 21, but is not massively present in chromosomes 14 and 22 [3]. The same applies to chromosomes 14/22 live HOR which is absent in chromosomes 13 and 21. Also, not all of them are confirmed by regular contigs (see Table S2). Another four SF4 + models reside on chromosomes 15 and 20 (2 on each). The GJ212105 HOR is also located in regular contigs ABBA01015870 and AL837517 in the same area on 20q. The latter contig has 54 kb HOR domain with HOR length 1872 bp. The high HOR similarity of 99.9%-100% is very unusual, which implies recent amplification. The large size of chromosome 15 reference models (GJ212036 and GJ212042) raises an interesting possibility that they are dead SF4 + HOR centromeres. Such centromeres were reported only in gibbons [30], but not in great apes, with the notable exceptions of human and chimpanzee Y chromosomes. In these cases, however, the SF4+ centromeres function in chromosomes that do not have more recent generations of AS, such as SF5 or the new families, which presumably would compete for centromeric function more efficiently than SF4+ [6]. Arguably, in chromosome 15, which has SF2 functional centromere and some SF5 as well, the SF4+ HOR domains may only be the dead centromeres that date back to the time before the great apes, when SF5 and the new families did not exist. However, high homogeneity (99%) of these HORs, both in the reference models and in regular contigs (listed in Table S3) and their absence in genomes of chimpanzee and gorilla (data not shown) argue that they are recent mega-scale pseudocentromeres.
Our theoretical scenario of AS centromere evolution [6] states that only a functional centromere can stably maintain the mega-amplified state and homogeneity characteristic of live centromeres due to the involvement of hypothetical kinetochore-associated recombination machine (KARM). There are at least two possibilities to reconcile the existence of mega-scale pseudocentromeres in the short arm of chromosome 15 with this view. First, it could be just a rare occasion of a very large and very recently amplified array, which is now experiencing shrinking and degradation that will become evident with time. The second, and more intriguing, possibility is that the short arms of acrocentric chromosomes possess their own special recombination machine involved in amplification and homogenization of the arrays of ribosomal RNA genes located there [31]. The machinery could also be used by some DNA repeats in the surrounding heterochromatin, which would contribute to generation and maintenance of a heterochromatic milieu of the short arms of acrocentric chromosomes apparently required for proper functioning of the nucleolus organizer regions.