Comparative genomics reveals insight into the evolutionary origin of massively scrambled genomes

Ciliates are microbial eukaryotes that undergo extensive programmed genome rearrangement, a natural genome editing process that converts long germline chromosomes into smaller gene-rich somatic chromosomes. Three well-studied ciliates include Oxytricha trifallax, Tetrahymena thermophila, and Paramecium tetraurelia, but only the Oxytricha lineage has a massively scrambled genome, whose assembly during development requires hundreds of thousands of precisely programmed DNA joining events, representing the most complex genome dynamics of any known organism. Here we study the emergence of such complex genomes by examining the origin and evolution of discontinuous and scrambled genes in the Oxytricha lineage. This study compares six genomes from three species, the germline and somatic genomes for Euplotes woodruffi, Tetmemena sp., and the model ciliate O. trifallax. We sequenced, assembled, and annotated the germline and somatic genomes of E. woodruffi, which provides an outgroup, and the germline genome of Tetmemena sp. We find that the germline genome of Tetmemena is as massively scrambled and interrupted as Oxytricha’s: 13.6% of its gene loci require programmed translocations and/or inversions, with some genes requiring hundreds of precise gene editing events during development. This study revealed that the earlier diverged spirotrich, E. woodruffi, also has a scrambled genome, but only roughly half as many loci (7.3%) are scrambled. Furthermore, its scrambled genes are less complex, together supporting the position of Euplotes as a possible evolutionary intermediate in this lineage, in the process of accumulating complex evolutionary genome rearrangements, all of which require extensive repair to assemble functional coding regions. Comparative analysis also reveals that scrambled loci are often associated with local duplications, supporting a gradual model for the origin of complex, scrambled genomes via many small events of DNA duplication and decay.


Introduction
Organisms do not always contain a single, static genome. Programmed genome editing is a naturally occurring and essential part of development in many organisms, including ciliates (Chen et al., 2014), nematodes (Mitreva et al., 2005), lampreys (Smith et al., 2012), and zebra finches (Biederman et al., 2018). Most of these events involve precise removal and rejoining of large regions of DNA during postzygotic differentiation of a somatic genome from a germline genome. Ciliates are microbial eukaryotes with two types of nuclei: a somatic macronucleus (MAC) that differentiates from a germline micronucleus (MIC). In the model ciliate Oxytricha, the MAC is entirely active chromatin (Beh et al., 2019) and the hub of transcription. The three species that we compare are all spirotrichs, which have gene-sized 'nanochromosomes' in the MAC, present at high copy number (Swart et al., 2013;Chen et al., 2015;Wang et al., 2016;Chen et al., 2019;Lindblad et al., 2019;Vinogradov et al., 2012). The diploid MIC participates in sexual reproduction, but its megabase-sized chromosomes are mostly transcriptionally silent.
Gene loci are often arranged discontinuously in the MIC, with short genic segments called macronuclear destined sequences (MDSs), interrupted by stretches of non-coding DNA called internally eliminated sequences (IESs) ( Figure 1A). During sexual development, a new MAC genome rearranges from a copy of the zygotic MIC genome. MDSs join in the correct order and orientation, whereas MIC-limited genomic regions undergo programmed deletion, including repetitive elements, intergenic regions, and IESs ( Figure 1A). Though analogous to intron splicing, these events occur on DNA. The MDSs for some MAC chromosomes are scrambled if they require translocation or inversion during MAC development ( Figure 1A). Pairs of short repeats, called pointers, are present at MDS-IES junctions in both scrambled and nonscrambled loci (Mitcham et al., 1992;Prescott, 1994). Pointer sequences are present twice in the MIC, at the end of MDS n and the beginning of MDS n+1. One copy of the repeat is retained at each MDS-MDS junction in a mature MAC chromosome ( Figure 1A). These microhomologous regions help guide MDS recombination, but most are non-unique, and the shortest pointers are just 2 bp. Thousands of long, noncoding template RNAs collectively program MDS joining (Nowacki et al., 2008;Lindblad et al., 2017;Yerlici and Landweber, 2014).
Numerous studies have inferred the possible scope of genome rearrangement in different ciliate species using partial genome surveys. In Paramecium, PiggyMac-depleted cells fail to remove MIClimited regions properly, which provided a resource to annotate ~45,000 IESs prior to assembly of a draft MIC genome (Arnaiz et al., 2012). The use of single-cell sequencing has allowed pilot studies to sample partial MIC genomes of diverse species (Chen et al., 2019;Maurer-Alcalá et al., 2018a;Maurer-Alcalá et al., 2018b;Smith et al., 2020). Alignment of tentative MIC reads to either assembled MAC genomes or single-cell transcriptome data predicts over 20 candidate scrambled loci in two basal ciliates, Loxodes sp. and Blepharisma americanum (Maurer-Alcalá et al., 2018b) and hundreds of candidate loci in the tintinnid Schmidingerella arcuata (Smith et al., 2020). Nearly one-third (31%) of approximately 5000 surveyed transcripts may be scrambled in Chilodonella uncinata (Maurer-Alcalá et al., 2018a, Figure 1B), which has four confirmed cases of scrambled genes (Katz and Kovner, 2010;Gao et al., 2014). Transcriptome-based surveys offer less precise estimates and cannot distinguish RNA splicing. Several computational pipelines have been developed to facilitate the inference of genome rearrangement features by split-read mapping in the absence of complete MIC or MAC reference genomes (Denby Wilkes et al., 2016;Zheng et al., 2020;Feng et al., 2020;Seah et al., 2021). By surveying lighter genome coverage prior to full sequencing, these tools provide partial insight into germline architecture. This helps guide selection of species for full genome sequencing and subsequent construction of complete rearrangement maps between the MIC and MAC genomes. High-quality MIC genome reference assemblies are only currently available for three ciliate genera: Oxytricha (Chen et al., 2014), Tetrahymena (Hamilton et al., 2016), and Paramecium (Guérin et al., 2017;Sellis et al., 2021).
Programmed genome rearrangements in Oxytricha exhibit the highest accuracy and largest scale of any known natural gene-editing system, with exquisite control over hundreds of thousands of precise DNA cleavage/joining events. Accordingly, its germline genome structure is arguably the most complex of any model organism (Chen et al., 2014), requiring programmed deletion of over 90% of the germline DNA during development and massive descrambling of the resulting fragments to construct a new MAC genome of over 18,000 chromosomes . This differs from the distantly related Tetrahymena and Paramecium that both eliminate ~30% of the germline genome (Hamilton et al., 2016;Guérin et al., 2017). Paramecium uses exclusively 2 bp pointers and lacks evidence of any scrambled loci. A small number of scrambled loci (4 confirmed out of 2711 candidates) have been reported in Tetrahymena (Sheng et al., 2020, Figure 1B). Tetrahymena and Paramecium diverged from Oxytricha over 1 billion years ago (Parfrey et al., 2011;Bracht et al., 2013), (1) Nonscrambled genes rearrange simply by joining consecutive macronuclear destined sequences (MDSs, blue boxes) and removing internal eliminated sequences (IESs, thin lines). (2) Rearrangement of scrambled genes requires MDS translocation and/or inversion. Pointers are microhomologous sequences (colored vertical bars) present in two copies in the MIC and only one copy in the MAC where consecutive MDSs recombine. (B) Comparison of genome rearrangement features of representative ciliates and the non-ciliate Plasmodium falciparum as an outgroup (phylogenetic information is based on Parfrey et al., 2011;Bracht et al., 2013). Conclusions from this study are shown in bold. * indicates that some scrambled pointers in Euplotes woodruffi are much longer, as discussed in the results. Statistics for pointers ≤30 bp in E. woodruffi are shown. Table information derives from the following sources: 1 - Swart et al., 2013;2 -Lindblad et al., 2019;3 -Chen et al., 2014;4 -Chen et al., 2015;5 -Sheng et al., 2020;6 -Eisen et al., 2006;7 -Hamilton et al., 2016;8 -Aury et al., 2006;9 -Guérin et al., 2017;10 -Arnaiz et al., 2012;11 -Riley and Katz, 2001;12 -Maurer-Alcalá et al., 2018a;13 -Katz and Kovner, 2010;14 -Gao et al., 2014. which leaves a large gap in our understanding of the emergence of complex DNA rearrangements in the Oxytricha lineage.
Open questions include how did the Oxytricha germline genome acquire its high number of IES insertions and how do scrambled loci arise and evolve. Three previous studies tackled these questions at the level of single genes and orthologs, including DNA polymerase α, actin I, and Telomere endbinding protein subunit α (Hogan et al., 2001;Chang et al., 2005;Wong and Landweber, 2006;DuBois and Prescott, 1995). Here, we provide the first comparative genomic analysis of Oxytricha trifallax and two other spirotrichous ciliates, Tetmemena sp. and Euplotes woodruffi. Tetmemena sp. is a hypotrich similar to Tetmemena pustulata, formerly Stylonychia pustulata (Chen et al., 2015), in the same family as O. trifallax ( Figure 1B; Chen et al., 2014;Chen et al., 2015). Hypotrichs are noted for the presence of scrambled genes, based on previous ortholog comparisons (Chen et al., 2015;Hogan et al., 2001;Chang et al., 2005;DuBois and Prescott, 1995; Figure 1B). E. woodruffi, together with the hypotrichous ciliates, belong to the class Spirotrichea ( Figure 1B). Like hypotrichs, Euplotes also has gene-sized nanochromosomes in the MAC genome (Wang et al., 2016;Chen et al., 2019;Chen et al., 2021), but this outgroup uses a different genetic code (UGA is reassigned to cysteine, Meyer et al., 1991), and little is known about its MIC genome. A partial MIC genome of Euplotes vannus was previously assembled, and it contains highly conserved TA pointers (Chen et al., 2019), consistent with previous observations in Euplotes crassus (Klobutcher and Herrick, 1995). This differs from O. trifallax, which uses longer pointers of varying lengths, with scrambled pointers typically longer than nonscrambled ones (Chen et al., 2014, Figure 1B). This observation suggests that longer pointers may supply more information to facilitate MDS descrambling, sometimes over great distances. Therefore, the preponderance of 2 bp pointers in the other Euplotes species could indicate limited capacity to support scrambled genes, and a partial genome survey of E. vannus concluded that at least 97% of loci are nonscrambled (Chen et al., 2019). Early studies of Euplotes octocarinatus, on the other hand, demonstrated its use of longer pointers (that usually contain TA) (Tan et al., 1999;Wang et al., 2005), suggesting that some members of the Euplotes genus may have the capacity to support complex genome reorganization. To investigate the origin of scrambled genomes, we choose E. woodruffi as an outgroup, because it is closely related to E. octocarinatus (Syberg-Olsen et al., 2016) and feasible to culture in the lab.
This study includes the de novo assemblies of the micronuclear genome of Tetmemena sp. and both genomes of E. woodruffi. The availability of MIC and MAC genomes for both species allows us to annotate and compare their genome rearrangement maps and other key features to each other and to O. trifallax. The MIC genome of Tetmemena is extremely interrupted, like Oxytricha. While the E. woodruffi MIC genome is much more IES-sparse, it contains thousands of scrambled genes, whose architecture we compare to orthologous loci in the other species. We infer that the evolutionary origin of scrambled genes is associated with local duplications, providing strong support for a previously proposed simple evolutionary model requiring only duplication and decay (Gao et al., 2015) that allows for the evolutionary expansion of extremely rearranged chromosome architectures.

Germline genome expansion via repetitive elements
Tetmemena sp. and E. woodruffi were both propagated in laboratory culture from single cells. The E. woodruffi MAC genome was sequenced and assembled from paired-end Illumina reads from whole cell DNA, which is mostly MAC-derived. For comparative analysis, the MAC genome of E. woodruffi was assembled using the same pipeline previously used for Tetmemena sp. (Chen et al., 2015). Because MIC DNA is significantly more sparse than MAC DNA in individual cells (Prescott, 1994), MIC DNA was enriched before sequencing (see Methods); however, this leads to much lower sequence coverage of the MIC than the MAC. Third-generation long reads (Pacific Biosciences and Oxford Nanopore Technologies) were combined with Illumina paired-end reads (Methods, see genome coverage in Supplementary file 1) to construct hybrid genome assemblies for Tetmemena sp. and E. woodruffi. Though the final genome assemblies are still fragmented, often due to transposon or other repetitive insertions at boundaries (Figure 2-figure supplement 1), the current draft assemblies cover most (>90%) MDSs for 89.1% of MAC nanochromosomes in Tetmemena, and for 90.0% of MAC nanochromosomes in E. woodruffi. This allowed us to establish near-complete rearrangement maps for the newly assembled genomes of Tetmemena and E. woodruffi, at a level comparable to the published reference for O. trifallax (Chen et al., 2014), which is appropriate for comparative analysis. Table 1 shows a comparison of genome features for the three species. The three MAC genomes are similar in size, with most nanochromosomes bearing only one gene. The size distributions of MAC chromosomes are similar for the three species, though slightly shorter for E. woodruffi, consistent with prior observation via gel electrophoresis (Prescott, 1994, Figure 2-figure supplement 2). Like O. trifallax (Swart et al., 2013), the maximum number of genes encoded on one chromosome is 7-8 (Table 1). Surprisingly, the MIC genome sizes differ substantially: the Tetmemena MIC genome assembly is 237 Mbp, nearly half that of Oxytricha. The E. woodruffi MIC genome assembly is even smaller, approximately 172 Mbp ( Table 1).
The expansion of repetitive elements in the Oxytricha lineage may contribute to the difference in MIC genome sizes (Figure 2A-C). Oxytricha has a variety of tranposable elements (TEs) in the MIC, with telomere-bearing elements (TBEs) of the Tc1/mariner family the most abundant (Chen et al., 2014;Chen and Landweber, 2016, Supplementary file 2). A complete TBE transposon contains three open reading frames (ORFs). ORF1 encodes a 42kD transposase with a DDE-catalytic motif. Though present only in the germline, TBEs are so abundant in hypotrichs that some were partially recovered and assembled from whole cell DNA . The Oxytricha MIC genome contains ~10,000 complete TBEs and ~24,000 partial TBEs, which occupy approximately 15.20% (75 Mbp) of the genome (Figure 2A, Supplementary file 3; Chen et al., 2014;Chen and Landweber, 2016). Tetmemena, on the other hand, has many fewer TBE ORFs and only 48 complete TBEs (Supplementary file 3), comprising 1.83% (4.3 Mbp) of its MIC genome ( Figure 2B). E. crassus has also been reported to have an abundant transposon family called Tec elements (Transposon of Euplotes crassus). Like TBEs, each Tec consists of three ORFs, and ORF1 also encodes a transposase from the Tc1/mariner family (Baird et al., 1989;Krikau and Jahn, 1991;Jahn et al., 1993;Jahn et al., 1989;Klobutcher and Herrick, 1997). The ~57 kD ORF2 encodes a tyrosine-type recombinase (Doak et al., 2003), and the 20kD ORF3 has unknown function (Jahn et al., 1993). Using the three ORFs of Tec1 and Tec2 as queries for search, we identified 74 complete Tec elements in E. woodruffi. Collectively, Tec ORFs occupy 3.6 Mbp, corresponding to only 2.1% of the MIC genome ( Figure 2C). Notably, the transposase-encoding ORF1 is more abundant than the other two TBE/Tec ORFs in all three ciliates (Supplementary file 3), consistent with its proposed role in DNA cleavage during genome rearrangement in Oxytricha (Nowacki et al., 2009). Oxytricha contains three families of TBEs. TBE3 appears to be the most ancient among hypotrichs, based on previous analysis of limited MIC genome data . We constructed phylogenetic trees using randomly subsampled TBE sequences for all three ORFs from Oxytricha and Tetmemena ( Figure 2D-F). This confirmed that only TBE3 is present in the Tetmemena MIC genome, as proposed in Chen and Landweber, 2016. This also suggests that TBE1 and TBE2 expanded in Oxytricha after its divergence from other hypotrichous ciliates. As illustrated in Figure 2-figure supplement 1, the MIC genome contexts of TBEs in Oxytricha and Tetmemena are similar, with many TE insertions within IESs, consistent with either IESs as hotspots for TE insertion or with the model (Klobutcher and Herrick, 1997) that some TE insertions may have generated IESs, as demonstrated in Paramecium (Sellis et al., 2021;Feng and Landweber, 2021). Subsequent sequence evolution at the edges of IES/MDS pointers (DuBois and Prescott, 1995) can give rise to boundaries that no longer correspond precisely to TBE ends. For further discussion of the conservation of TBE locations, see the section, 'Oxytricha and Tetmemena share conserved rearrangement junctions' below.
Additionally, Repeatmodeler/Repeatmasker identified that Oxytricha has more MIC repeats in the 'Other' category than Tetmemena or E. woodruffi (Figure 2, subcategories of repeat content in For each ORF, 30 protein sequences from each species were randomly subsampled and maximum likelihood trees constructed using PhyML (Guindon et al., 2010).
The online version of this article includes the following figure supplement(s) for figure 2:

The E. woodruffi genome has fewer IESs
We used the genome rearrangement annotation tool, Scrambled DNA Rearrangement Annotation Protocol (SDRAP, Braun et al., 2022) to annotate the MIC genomes of Oxytricha, Tetmemena, and E. woodruffi (Methods). Consistent with their close genetic distance, the genomes of O. trifallax and Tetmemena have similarly high levels of discontinuity ( Figure 3A). We annotated over 215,299 MDSs in Oxytricha and over 215,624 in Tetmemena with similar MDS length distributions ( Figure 3A). By contrast, E. woodruffi MDSs are typically longer, which indicates a less interrupted genome ( Figure 3A). We compared the number of MDSs between single-copy orthologs for single-gene MAC  There is a strong positive correlation between number of MDSs for orthologous genes in Oxytricha and Tetmemena (R 2 =0.75, Figure 3B). There is no correlation among number of MDSs between orthologs of E. woodruffi and Oxytricha (R 2 =0.003, Figure 3C), since E. woodruffi orthologs typically contain fewer MDSs.
The E. woodruffi genome is generally much less interrupted than that of Oxytricha or Tetmemena. 39.9% of MAC nanochromosomes in E. woodruffi lack IESs (IES-less nanochromosomes) compared to only 4.1 and 4.4% in Oxytricha and Tetmemena, respectively. The sparse IES distribution (as measured by plotting pointer distributions) in E. woodruffi displays a curious 5' end bias on single-gene MAC chromosomes, oriented in gene direction ( Figure 3E). A weak 5' bias is also present in Oxytricha ( Figure 3D) and Tetmemena (Figure 3-figure supplement 1C). In addition, E. woodruffi IESs preferentially accumulate in the 5' UTR, a short distance upstream of start codons ( Figure 3F). Notably, the median distance between the 5' telomere addition site and the start codon in E. woodruffi is just 54 bp for singlegene chromosomes, approximately half that of Oxytricha (Swart et al., 2013).

E. woodruffi has an intermediate level of genome scrambling
Scrambled genome rearrangements exist in all three species, which we report here for the first time in Tetmemena and the early diverged E. woodruffi. Previous studies have described scrambled genes with confirmed MIC-MAC rearrangement maps for a limited species of hypotrichs (Chen et al., 2014;Chen et al., 2015;Hogan et al., 2001;Chang et al., 2005;Wong and Landweber, 2006;DuBois and Prescott, 1995) and Chilodonella (Katz and Kovner, 2010;Gao et al., 2014) but not in Euplotes. Consistent with the phylogenetic placement of Euplotes as an earlier diverged outgroup to hypotrichs (Lynn, 2008;Gao et al., 2016), the E. woodruffi genome is scrambled, but it contains approximately half as many scrambled genes (2429 genes encoded on 1913 chromosomes, or 7.3% of genes), versus 15.6% scrambled in O. trifallax (3613 genes encoded on 2852 chromosomes) and 13.6% in Tetmemena (3371 genes encoded on 2556 chromosomes). The E. woodruffi lineage may therefore reflect an evolutionary intermediate stage between ancestral genomes with only modest levels of genome scrambling and the more massively scrambled genomes of hypotrichs.   We infer that many genes were likely scrambled in the last common ancestor of Oxytricha and Tetmemena, because these two species share approximately half of their scrambled genes (Supplementary file 4). Furthermore, most scrambled genes are not new genes, since they possess at least one ortholog in other ciliate species (Supplementary file 4, Supplementary file 5).

Scrambled genes are associated with local paralogy
Notably, scrambled genes in all three species generally have more paralogs (Figure 4). We identified orthogroups containing genes derived from the same gene in the last common ancestor of the three species (Methods). For each species, orthogroups with at least one scrambled gene are significantly larger than those containing no scrambled genes (p-value <1e−5, Mann-Whitney U test, Figure 4A-C). This association suggests a possible role of gene duplication in the origin of scrambled genes.
Scrambled pointers are generally longer than nonscrambled ones in all three species (Figure 3figure supplement 2), consistent with prior observations (Chen et al., 2014) and the possibility that longer pointers participate in more complex rearrangements, including recombination between MDSs separated by greater distances (Landweber et al., 2000). Scrambled and nonscrambled IESs also differ in their length distribution (Figure 3-figure supplement 2). Curiously, scrambled 'pointers' in E. woodruffi can be as long as several hundred base pairs (median 48 bp, average 212 bp) unlike the more typical 2-20 bp canonical pointers. These long 'pointers' in E. woodruffi are more likely partial MDS duplications (Figure 4-figure supplement 1A). We also identified MDSs that map to two or more paralogous regions within the same MIC contig (Supplementary file 6), therefore representing MDS duplications and not alleles. Such paralogous regions could be alternatively incorporated into the rearranged MAC product. Moreover, we find that, for all three species, there are significantly more scrambled chromosomes than nonscrambled MAC chromosomes that contain at least one paralogous MDS (chi-square test, p-value <1e−10; Supplementary file 6). An example is shown in Figure 4figure supplement 1A (MDS 7 and 7').
The presence of paralogous MDSs can contribute to the origin of scrambled rearrangements, as proposed in an elegant model by Gao et al., 2015; illustrated in Figure 4-figure supplement 1B. The model proposes that initial MDS duplications permit alternative use of either MDS copy into the mature MAC chromosome. As mutations accumulate in redundant paralogs, cells that incorporate the least decayed MDS regions into the MAC gene would have both a fitness advantage and a better match to the template RNA (Nowacki et al., 2008) that guides rearrangement, thus increasing the likelihood of incorporation into the MAC chromosome. The paralogous regions containing more mutations would gradually decay into IESs, and scrambled pointers eventually be reduced to a shorter length. The extended length 'pointers' that we identified in E. woodruffi may reflect an intermediate stage in the origin of scrambled genes ( Figure 4-figure supplement 1B).
This model may generally explain the abundance and expansion of 'odd-even' patterns in ciliate scrambled genes (Landweber et al., 2000;Burns et al., 2016). As illustrated in Figure 4-figure supplement 1A, the even-and odd-numbered MDSs for many scrambled genes derive from different MIC genome clusters. The model predicts that the IES between MDS n−1 and n+1 often derives from ancestral duplication of a region containing MDS n (Figure 4-figure supplement 2A). To test this hypothesis explicitly, we extracted from all odd-even scrambled loci in the three species all sets of corresponding MDS/IES pairs that are flanked by identical pointers on both sides, i.e., all pairs of scrambled MDSs and IESs, where the IES between MDS n−1 and n+1 is directly exchanged for MDS n during DNA rearrangement (S1 and S2 in Figure 4-figure supplement 2A). To exclude the possibility of alleles confounding this analysis, MDS and IES pairs were only considered if they map to the same MIC contig. In E. woodruffi, the lengths of these MDS/IES pairs strongly correlate (Spearman correlation ρ=0.755, p<1e−5, Figure 4-figure supplement 2B). Moreover, many MDS and IES sequence pairs also share sequence similarity, consistent with paralogy: for 248 MDS-IES pairs of similar length, 90.3% share a core sequence with ~97.5% identity across 8-100% of both the IES and MDS length. The lowest end of these observations is also compatible with an alternative model (Chang et al., 2005) in which direct recombination between IESs and MDSs at short repeats can lead to expansion of odd-even patterns. For Oxytricha and Tetmemena, the MDS and IES lengths for such MDS/IES pairs also display a weakly-positive correlation (p-values and Spearman correlation ρ shown in Figure 4-figure supplement 2D-E). Remarkably, the odd-even-containing loci that are species-specific, and therefore became scrambled more recently, have the strongest length correlation ( Figure 4-figure supplement 2C-E) and more pairs that display sequence similarity (Supplementary file 7) relative to older loci (scrambled in two or more species). This result is consistent with an evolutionary process in which mutations accumulate in one copy of the MDS, gradually obscuring its sequence homology and ability to be incorporated as a functional MDS, and eventually its ability to be recognized by the template RNAs that guide DNA rearrangement. This analysis also suggests that most of the odd-even scrambled loci in E. woodruffi arose recently, because there is greater sequence similarity between MDSs and the corresponding IESs that they replace. Conversely, we infer that most loci that are scrambled in both Oxytricha and Tetmemena became scrambled earlier in evolution, since they display weaker sequence similarity between exchanged MDS and IES regions.
Scrambled and nonscrambled genes display nearly identical expression support (the presence of at least one read in all three replicates) in both Oxytricha (Supplementary file 8) and Tetmemena. E. woodruffi has slightly more expression support for nonscrambled vs. scrambled genes (Figure 4figure supplement 3), which could be explained by more recent acquisition of thousands of scrambled loci in E. woodruffi. In some of those cases the nonscrambled paralogs may still contribute the major function. The distribution of expression levels is similar for scrambled vs. nonscrambled genes in all three species, supporting their authenticity (Figure 4-figure supplement 3), although in a Mann-Whitney U test, the average expression level of three replicates is significantly higher in nonscrambled genes for Oxytricha and E. woodruffi, but not significant for Tetmemena.

Oxytricha and Tetmemena share conserved DNA rearrangement junctions
To understand the conservation of genome rearrangement patterns, we developed a pipeline guided by protein sequence alignment to compare pointer positions for orthologous genes between any two species (Methods, Figure 5A). We compared pointers for 2503 three-species single-copy orthologs. 4448 pointer locations are conserved between Oxytricha and Tetmemena on 1345 ortholog pairs (Supplementary file 9), representing 38.3% of pointers in these orthologs in Oxytricha and 30.9% in Tetmemena. For Oxytricha/E. woodruffi and Tetmemena/E. woodruffi comparisons, 56 and 58 pointer pairs are conserved, respectively. We also identified 23 pointer locations shared among all three species (Supplementary file 9, Figure 5B, Figure 5-source data 1).
To test if these pointer locations are genuinely conserved versus coincidental matching by chance, we performed a Monte Carlo simulation, as also used to study intron conservation (Rogozin et al., 2003). We randomly shuffled pointer positions on CDS regions 1000 times and counted the number of conserved pointer pairs expected for each simulation (Methods). Of the 1000 simulations, none exceeded the observed number of conserved pointer pairs between Oxytricha and Tetmemena (p-value <0.001), suggesting evolutionary conservation of pointer positions (Supplementary file 9). A similar result was obtained for pointers conserved in all three species (Supplementary file 9). However, the numbers of pointer pairs conserved between Oxytricha/E. woodruffi and Tetmemena/E. woodruffi is similar to the expectations by chance (Supplementary file 9). The low level of pointer conservation of either hypotrichs with E. woodruffi may reflect the smaller number of IESs in E. woodruffi; hence, most pointers would have arisen in the hypotrich lineage. Furthermore, E. woodruffi is genetically more distant from the two hypotrichs; hence, the accumulation of substitutions would obscure protein sequence homology, which we used to compare pointer locations. For ortholog pairs between Oxytricha and Tetmemena, scrambled pointers are significantly more conserved than nonscrambled ones (chi-square test, p-value <1e−10, Supplementary file 10). We also find that most pointer sequences differ even if the positions are conserved ( Figure 5B, Figure 5-source data 1, Supplementary file 11), suggesting that substitutions may accumulate in pointers without substantially altering rearrangement boundaries.
Oxytricha and Tetmemena both contain a high copy number of TBE transposons (Chen et al., 2014;Chen and Landweber, 2016;Supplementary file 3). We investigated the level of TBE conservation between these two species. To identify orthologous insertions, we focus on TBE insertions in nonscrambled IESs on single-copy orthologs, which include 1706 Oxytricha TBEs inserted in 1296 nonscrambled IESs (multiple TBEs can be inserted into an IES) and 180 Tetmemena TBEs inserted into 170 nonscrambled IESs. We refer to the pointer flanking a TBE-containing IES as a TBE pointer. No TBE pointer locations are conserved between two species. This suggests that TBEs might invade the  Figure 5C). No Tetmemena TBE pointer is conserved with an Oxytricha non-TBE pointer. This suggests that TBE insertions may preferentially produce new rearrangement junctions instead of inserting into an existing IES.

Intron locations sometimes coincide with DNA rearrangement junctions
Ciliate genomes are generally intron-poor. Oxytricha averages 1.7 introns/gene, Tetmemena has 1.1, and E. woodruffi has 2.2. Among three-species orthologs, intron locations sometimes map near pointer positions (within a 20-bp window, Figure 5B, Figure 5-figure supplement 1). IESs and introns are both noncoding regions that are removed from mature transcripts, though at different stages. A previous single-gene study observed that an IES in Paraurostyla overlaps the position of an intron in Uroleptus, Urostyla, and also the human homolog (Chang et al., 2005). This observation suggested an intron-IES conversion model in which the ability to eliminate non-CDS regions as either DNA or RNA provides a potential backup mechanism. Such interconversion has also been observed between two strains of Stylonychia (Möllenbeck et al., 2006). In the present study, we identified 174 potential cases of intron-IES conversion in the three species ( The observation that E. woodruffi has the most introns but the smallest number of IESs per gene (Figure 3) is consistent with removal of intragenic non-CDS regions as either DNA or RNA. The intronsparseness of ciliates is compatible with a hypothesis that it is advantageous to eliminate noncoding regions earlier at the DNA level, with intron deletion sometimes providing an opportunity for repair if they fail to be excised as IESs (Chang et al., 2005).

Evolution of complex genome rearrangements: Russian doll genes
Genome rearrangements in the Oxytricha lineage can include overlapping and nested loci, with MDSs for different MAC loci embedded in each other (Chen et al., 2014;Braun et al., 2018). When multiple gene loci are nested in each other, these have been called Russian doll loci (Braun et al., 2018). Oxytricha contains two loci with five or more layers of nested genes (Braun et al., 2018). Oxytricha and Tetmemena display a high degree of synteny and conservation in both Russian doll loci. In the first Russian doll gene cluster, one nested gene (green) is present in Oxytricha but absent in Tetmemena ( Figure 6A, Figure 6-figure supplement 1, Figure 6-figure supplement 2), confirmed by PCR (Methods). Oxytricha also has a complete TBE3 insertion in the green gene ( Figure 6A, Figure 6 Figure 6A, Figure 6-figure supplement 1). In Oxytricha, seven orange MDSs ligate across two other loci via an 18-bp pointer (TATA TCTA TACT AAAC TT) to form a two-gene nanochromosome. However, in Tetmemena, telomeres are added to the ends of both gene loci instead, forming two independent MAC chromosomes ( Figure 6A, Figure 6-figure supplement 1). The second    Figure 6A).
Russian doll locus has an example of a long, conserved pointer (orange dotted line) that bridges three other loci (the green and blue scrambled loci and one nonscrambled locus, Figure 6B). Close to this region is a decayed TBE insertion (769 bp) in Oxytricha. None of the E. woodruffi orthologs of both Russian doll loci maps to the same MIC contig, which suggests that the Russian doll clusters arose after the divergence of Euplotes from the common ancestor of Oxytricha and Tetmemena.

Discussion
The highly diverse ciliate clade provides a valuable resource for evolutionary studies of genome rearrangement. However, full assembly and annotation of germline MIC genomes have concentrated on the model ciliates Tetrahymena, Paramecium, and Oxytricha. To provide insight into genome evolution in this lineage, we assembled and compared germline and somatic genomes of Tetmemena sp. and an outgroup, E. woodruffi, to that of O. trifallax. This expands our knowledge of the diversity of ciliate genome structures and the evolutionary origin of complex genome rearrangements.
Dramatic variation in transposon copy number (TBE and Tec elements) from the Tc1/mariner family appears to explain most of the variation in MIC genome size. In many eukaryotic taxa, genome size can differ dramatically even for closely related species, a phenomenon known as the 'C-value paradox' (Thomas, 1971). Our present observations are compatible with previous reports that the repeat content of the genome, especially transposon content, positively correlates with genome size (Elliott and Gregory, 2015).
Oxytricha has three TBE families in the MIC genome, but only TBE3 is present in Tetmemena, consistent with our previous conclusion that TBE3 is ancestral to the base of the transposon lineage in hypotrichous ciliates . Tens of thousands of TBE1/2 transposons then expanded specifically in Oxytricha. Despite a high copy number of TBEs in both Oxytricha and Tetmemena, we find no identical TBE locations in nonscrambled IESs, even among syntenic Russian doll regions. These observations suggest that TBEs may be active in these genomes and contribute to the evolution of genome structure.
In the relatively IES-poor genome of E. woodruffi, IESs accumulate upstream of start codons, similar to the 5' bias of introns in intron-poor organisms (Mourier and Jeffares, 2003). The simplest model to explain 5' intron bias is homologous recombination between a reverse transcript of an intron-lacking mRNA and the original DNA locus to erase introns in the coding region (Mourier and Jeffares, 2003). A similar mechanism could simultaneously erase IESs in coding regions via germline recombination between the MIC chromosome and a reverse transcript; however, they are usually in different subcellular locations. More plausibly, a source for DNA recombination could be a MAC nanochromosome, since they are already abundant at high copy number, but another source could be by capture of a reverse transcript of a long non-coding template RNA that guides DNA rearrangement (Nowacki et al., 2008;Lindblad et al., 2017). Either recombination event in the germline would lead to loss of IESs, while retaining introns, but neither would necessarily provide a bias for IES-loss in coding regions. Any of these infrequent events would be meaningful on an evolutionary time scale, even if developmentally rare. The 5' bias of IESs could also reflect an evolutionary bias for continuous coding regions. Alternatively, upstream IESs might regulate gene expression or cell growth (Sellis et al., 2021), like some introns (Parenteau et al., 2019;Morgan et al., 2019).
This study investigated the evolution of scrambled genes by comparing Oxytricha and Tetmemena to E. woodruffi, as an earlier diverged representative of the spirotrich lineage. While E. woodruffi has approximately half as many scrambled genes as Tetmemena and Oxytricha, its genes are also much more continuous. For example, the most scrambled gene in E. woodruffi, encoding a DNA replication licensing factor (EUPWOO_MAC_28518, 3 kb), has only 20 scrambled junctions. The most scrambled gene in Tetmemena (LASU02015934.1, 14.7 kb, encoding a hydrocephalus-inducing-like protein) has 204 scrambled pointers, and the most scrambled gene in Oxytricha (Contig17454.0, 13.7 kb, encoding a dynein heavy chain family protein, Chen et al., 2014) is similarly complex, with 195 scrambled junctions. Together, these observations are consistent with our interpretation that E. woodruffi reflects an evolutionary intermediate stage, as it contains both fewer scrambled loci and fewer scrambled junctions within its scrambled loci. The observation that the most scrambled locus differs in each species is also consistent with the conclusion that complex gene architectures may continue to elaborate independently.
We observed that scrambled genes in each species tend to have more paralogs than nonscrambled genes. Similarly, in C. uncinata (Maurer-Alcalá et al., 2018a), a distantly related ciliate in the class Phyllopharyngea that also has scrambled genes, scrambled gene families (orthogroups) contain more genes (~2.9) than nonscrambled gene families (~1.3) (Maurer-Alcalá et al., 2018a). Apart from duplications at the gene level, E. woodruffi often contains partial MDS duplications at scrambled junctions, annotated as unusually long 'pointers' (Figure 4-figure supplement 1). We also demonstrate that odd-even scrambled patterns could readily arise from local duplications (Figure 4-figure  supplement 2). These observations are most consistent with a simple model (Gao et al., 2015) in which local duplications permit combinatorial DNA recombination between paralogous germline regions, and mutation accumulation in either paralogs establishes an odd-even scrambled pattern that can propagate by weaving together segments from paralogous sources. Other proposed models include Hoffman and Prescott, 1997 IES-invasion model that suggested that pairs of IESs could invade an MDS, and then subsequently recombine with another IES to yield odd-even scrambled regions; however, a previous examination did not find support for this model (Chang et al., 2005). Prescott et al., 1998 also proposed that some odd-even scrambled loci could arise suddenly via reciprocal recombination with loops of A/T-rich DNA, but this does not exploit paralogy, only the high A/T content in the MIC. We previously proposed a gradual model (Chang et al., 2005;Landweber et al., 2000) in which MDS/IES recombination at short AT-rich repeats (precursors to pointers) could generate and propagate odd-even scrambled patterns. While limited comparisons of orthologs favored the stepwise recombination models (Hogan et al., 2001;Chang et al., 2005;Wong and Landweber, 2006;DuBois and Prescott, 1995), none of the earlier models accounted for the widespread existence of partial paralogy, revealed by genome assemblies.
Local duplications provide a buffer against mutations, allowing paralogous MDSs to repair the MAC locus during assembly of odd/even scrambled genes. Therefore, once an odd/even scrambled locus is established, a consequence is that evolution can only proceed in the direction of accumulating more scrambled junctions, as each new mutation in one paralog necessitates repair via incorporation of the other paralog (Figure 4-figure supplement 1B). This shortens the length of the respective MDSs and increases the number of recombination junctions, creating an evolutionary ratchet that drives the increase in scrambling. The lack of the presence of an error-free, continuous version of this locus in the germline reduces the possibility of losing the scrambled pattern from the MIC genome, relative to the trend toward decreasing MDS lengths as more mutations accumulate in either paralogs, with a resulting increase in the levels of scrambling and fragmentation (Landweber, 2007;Speijer, 2008). The only opportunity to repair a scrambled locus in the MIC would be a rare event that replaces the locus via recombination with a continuous version from the parental MAC, with the source being either parental MAC DNA or a reverse transcript of a template RNA (Nowacki et al., 2008;Lindblad et al., 2017), as discussed above.
Recent exciting reports have also described scrambled genomes in metazoa, including cephalopods Albertin et al., 2022), but those events entail primarily evolutionary shuffling of gene order, without accompanying genome editing or repair. The ciliate lineage is remarkable in having evolved a sophisticated mechanism of RNA-guided genome editing that allows accurate and precise DNA repair of translocations and inversions. The future opportunity to harness this system to develop novel tools for genome editing outside of Oxytricha offers exciting directions.

Methods
DNA collection and sequencing of Tetmemena sp. Tetmemena sp. (strain SeJ-2015;Chen et al., 2015) was isolated as a single cell from a stock culture and propagated as a clonal strain via vegetative (asexual) cell culture. Cells were cultured in Pringsheim media (0.11 mM Na 2 HPO 4 , 0.08 mM MgSO 4 , 0.85 mM Ca(NO 3 ) 2 , 0.35 mM KCl, pH 7.0) and fed with Chlamydomonas reinhardtii, together with 0.1%(v/v) of an overnight culture of non-virulent Klebsiella pneumoniae. Macronuclei and micronuclei were isolated using sucrose gradient centrifugation (Lauth et al., 1976). Genomic DNA was subsequently purified using the Nucleospin Tissue Kit (Takara Bio USA, Inc). Macronuclear DNA was sequenced and assembled in Chen et al., 2015. Micronuclear DNA was further size-selected via BluePippin (Sage Science) for PacBio sequencing, or via 0.6% (w/v) SeaKem Gold agarose electrophoresis (Lonza) for Illumina sequencing. Micronuclear DNA purification and sequencing protocols are described in Chen et al., 2014. DNA collection and sequencing for E. woodruffi E. woodruffi (strain Iz01) was cultured in Volvic water at room temperature and fed with green algae every 2-3 days. We fed cells with C. reinhardtii for MAC DNA collection, and switched to Chlorogonium capillatum for MIC DNA collection. In order to remove algal contamination, cells were starved for at least 2-3 days before collection. Cells were washed and concentrated as in Chen et al., 2014. Because MAC DNA is predominant in whole cell DNA, we used whole cell DNA (purified via NucleoSpin Tissue kit, Takara Bio USA, Inc) for MAC genome sequencing. Paired-end sequencing was performed on an Illumina Hiseq2000 at the Princeton University Genomics Core Facility.
MIC DNA was enriched from whole cell DNA and sequenced via three sequencing platforms (Illumina, Pacific Biosciences, and Oxford Nanopore Technologies). We used conventional and pulse-field gel electrophoresis (PFGE) to enrich MIC DNA: 1. High-molecular-weight DNA was separated from whole cell DNA by gel-electrophoresis (0.25% agarose gel at 4°C, 120 V for 4 hr). The top band was cut from the gel and purified with the QIAGEN QIAquick kit. The purified high-molecular-weight DNA was directly sent to the group of Dr. Robert Sebra at the Icahn School of Medicine at Mount Sinai for library construction and sequencing. BluePippin (Sage Science) separation was used before sequencing to select DNA >10 kb. DNA was sequenced on two platforms: Illumina HiSeq2500 (150 bp paired-end reads) and PacBio Sequel (SMRT reads). 2. High-molecular-weight DNA was also enriched by PFGE. E. woodruffi cells were mixed with 1% low-melt agarose to form plugs according to Akematsu et al., 2017, with addition of 1 hr incubation with 50 μg/ml RNase (Invitrogen AM2288) in 10 mM Tris-HCl (pH7.5) at 37°C for RNA depletion. After three washes of 1 hr with 1× TE buffer, the DNA plugs were incubated in 1 mM phenylmethylsulfonyl fluoride (PMSF) to inactivate proteinase K, followed by MspJI (New England Biolabs) digestion at m CNNR(9/13) sites to remove contaminant DNA ( m C indicates C5-methylation or C5-hydroxymethylation). Previous reports have shown that no methylcytosine is detectable in vegetative cells of Oxytricha (Bracht et al., 2012), Tetrahymena (Gorovsky et al., 1973), and Paramecium (Cummings et al., 1974), suggesting that C5-methylation and C5-hydroxymethylation are rarely involved in the vegetative growth of the ciliate lineage. We also validated by qPCR that the quantity of two randomly selected MIC loci is not changed after the MspJI digestion. On the contrary, algal genomic DNA is significantly digested by MspJI. Based on these results, we conclude that MspJI digestion can be used to remove bacterial and algal DNA with C5-methylation and C5-hydroxymethylation, leaving E. woodruffi MIC DNA intact. The agarose plugs containing digested DNA were then inserted into wells of 1.0% Certified Megabase agarose gel (Bio-Rad) for PFGE (CHEF-DR II System, Bio-Rad). The DNA was separated at 6 V, 14°C with 0.5× TBE buffer at a 120° angle for 24 hr with switch time of 60-120 s. We validated by qPCR that the E. woodruffi MIC chromosomes were not mobilized from the well, while the MAC DNA migrated into the gel. The MIC DNA was then extracted by phenol-chloroform purification. Library preparation and sequencing were performed at Oxford Nanopore Technologies (New York, NY).

RNA sequencing of E. woodruffi and Tetmemena sp.
Three biological replicates of total RNA was isolated from asexually growing E. woodruffi and Tetmemena sp. cells using TRIzol reagent (Thermo Fisher Scientific) and enriched for the poly(A)+fraction using the NEBNext Poly(A) mRNA Magnetic Isolation Module (New England Biolabs). Stranded RNAseq libraries were constructed using the ScriptSeq v2 RNA-seq library preparation kit (Epicentre) and sequenced on an Illumina Nextseq500 at the Columbia Genome Center. For E. woodruffi, the transcriptome was assembled by Trinity (Grabherr et al., 2011), and transcript alignments to the MAC genome were generated by PASA (Haas et al., 2003).

Gene prediction of the E. woodruffi MAC genome and validation of MAC genome completeness
We followed the gene prediction pipeline developed by the Broad institute (https://github.com/ PASApipeline/PASApipeline/wiki); using EVidenceModeler (EVM, Haas et al., 2008) to generate the final gene predictions. EVM produced gene structures by weighted combination of evidence from three resources: ab initio prediction, protein alignments, and transcript alignments (the weight was 3, 3, and 10 respectively). Ab initio prediction was generated by BRAKER2 pipeline (Brůna et al., 2021). Protein alignments for EVM were generated by mapping Oxytricha proteins to the E. woodruffi MAC genome by Exonerate (Slater and Birney, 2005). EVM predicted 33,379 genes on MAC chromosomes with at least one telomere.
We assessed MAC genome completeness using three methods: (1) 28,294 (80.6%) of the 35,099 E. woodruffi MAC contigs have at least one telomere. (2) In the E. woodruffi genes predicted on telomeric contigs, 88.8% of BUSCO (Simão et al., 2015;Manni et al., 2021) genes in the lineage database alveolata_odb10 were identified as complete. Within the 171 BUSCO genes, 135 are complete and single-copy, 17 are complete and duplicated, 7 are fragmented, and 12 are missing. This represents the best Euplotes MAC genome assembly available. (3) We identified 51 tRNA genes encoding all 20 amino acids by tRNAscan-SE (Lowe and Eddy, 1997) in the MAC genome, including two suppressor tRNAs of UAA and UAG.

MIC genome assembly of Tetmemena sp.
The MIC genome of Tetmemena was assembled with a hybrid approach to combine reads from different sequencing platforms. Tetmemena Illumina reads were first assembled by SPAdes (77,33,55,77,99,. PacBio reads were error corrected by FMLRC (Wang et al., 2018) using Illumina reads with default parameters. Corrected PacBio reads were aligned to both the MAC genome and the Illumina MIC assembly with BLASTN. Reads were removed if they start or end with telomeres or are aligned better to the MAC. The remaining reads were assembled with wtdbg2 (Ruan and Li, 2020, parameters '-x rs'). The PacBio assembly was polished by Pilon (Walker et al., 2014) with the '--diploid' option. The Illumina and PacBio assemblies were merged by quickmerge (Chakraborty et al., 2016) with the '-l 5000' option.

MIC genome assembly of E. woodruffi
The MIC genome of E. woodruffi was assembled using a similar procedure as described above for Tetmemena. E. woodruffi reads were filtered to remove bacterial contamination, including abundant high-GC-content contaminants, possibly endosymbionts (Boscaro et al., 2019). Nanopore reads with GC content ≥55% were assembled by Flye (Kolmogorov et al., 2019) with the parameter '--meta' for metagenomic assembly of bacterial contigs. We used kaiju (Menzel et al., 2016) to identify bacteria taxa for these contigs. 9 of 10 top-covered contigs derive from Proteobacteria, from which many Euplotes symbionts derive (Boscaro et al., 2019). Bacterial contamination was removed from Illumina reads if perfectly mapping to these metagenomic contigs by Bowtie2 (Langmead and Salzberg, 2012). The cleaned Illumina reads were then assembled by SPAdes with '-k 21,33,55,77,99,127' (Bankevich et al., 2012). Pacbio raw reads and Nanopore raw reads with GC content <55% were aligned to a concatenated database containing both the MAC genome and the Illumina MIC assembly with BLASTN. Reads were removed if they start or end with telomeres or align better to the MAC. Remaining PacBio/Nanopore reads were assembled by Flye with '--meta' mode. The PacBio-Nanopore assembly was polished by Pilon with the '--diploid' option. Illumina and PacBio-Nanopore assemblies were merged by quickmerge with the '-l 10000' option. Contigs shorter than 1 kb were removed.

MIC genome decontamination
The draft MIC genome of Tetmemena was first mapped to telomeric MAC contigs by BLASTN. MIC contigs containing MDSs were included in the final assembly. The rest of the MIC contigs were filtered by a decontamination pipeline: (1) contigs were aligned to the K. pneumoniae genome, C. reinhardtii genome, and the Oxytricha mitochondrial genome by BLASTN to remove contaminants; (2) the remaining contigs were then searched against the bacteria NR database and a ciliate protein database (including protein sequences annotated in Tetrahymena thermophila: http://www.ciliate.org/system/ downloads/tet-latest/4-Protein%20fasta.fasta; Paramecium tetraurelia: http://paramecium.cgm.cnrsgif.fr; and O. trifallax: https://oxy.ciliate.org) by BLASTX. Contigs with higher bit score to bacteria NR or G+C >45% were removed. The E. woodruffi MIC genome was decontaminated, similarly, with addition of all Chlorogonium sequences (the algal food source) on NCBI and the two Euplotes mitochondrial genomes (E. minuta GQ903130.1, E. crassus GQ903131.1) to filter contaminants.

Repeat identification
The repeat content in the MIC genomes was identified by RepeatModeler 1.0.10 (Smit and Hubley, 2008) and RepeatMasker 4.0.7 (Smit et al., 2013) with default parameters.

Rearrangement annotations
SDRAP (Braun et al., 2022) was used to annotate MDSs, pointers, and MIC-specific regions (minimum percent identity for preliminary match annotation = 95, minimum percent identity for additional match annotation = 90, minimum length of pointer annotation = 2). SDRAP requires MAC and MIC genomes as input. For the SDRAP annotation of Oxytricha, we used the MAC genome from Swart et al., 2013 instead of the latest hybrid assembly that incorporated PacBio reads , because the former version was primarily based on Illumina reads, similar to the MAC genomes of Tetmemena (7, Genbank GCA_001273295.2) and E. woodruffi which are also Illumina assemblies. Oxytricha and Tetmemena MAC genomes were preprocessed by removing MAC contigs with TBE ORFs, considered MIC contaminants . SDRAP is a new program that can output the rearrangement annotations with minor differences from Chen et al., 2014, but most annotations are robust (Figure 3-figure supplement 2). Scrambled and nonscrambled junctions/IESs were annotated by custom python scripts (https://github.com/yifeng-evo/Oxytricha_Tetmemena_Euplotes/tree/main/ scrambled_nonscrambled_IES_pointer).

MIC genome categories
Each MIC genome region is assigned to only one category in Figure 2A-C, even if it belongs to more than one category. The assignment is based on the following priority: MDS, TBE/Tec, MIC genes (only available for Oxytricha, which has developmental RNA-seq data), IES, tandem repeats, other repeats, and non-coding non-repetitive regions. For example, an MIC region can be a TBE in an IES, and it is only considered as TBE in Figure 2A-C.

Ortholog comparison pipeline and Monte Carlo simulations
Orthogroups of genes on telomeric MAC contigs were detected by OrthoFinder with '-S blast' (Emms and Kelly, 2019). Single-copy orthologs were aligned by Clustal Omega (Sievers et al., 2011). Protein alignments were reversely translated to CDS alignments by a modified script of pal2nal (Suyama et al., 2006, https://github.com/yifeng-evo/Oxytricha_Tetmemena_Euplotes/tree/main/Ortholog_ comparison/pal2nal.pl). Two modifications were made in the script: (1) the modified script allows pal2nal to take different genetic codes for three sequences (-codontable 6,6,10); (2) the script also fixed an error in the original pal2nal script in which codontable 10 for the Euplotid nuclear code was the same as the universal code. Visualization of pointer positions and intron locations on orthologs was implemented by a custom python script (https://github.com/yifeng-evo/Oxytricha_Tetmemena_ Euplotes/blob/main/Ortholog_comparison/visualization_of_ortholog_comparison.py). Pointer positions or intron locations are considered conserved if they are within a 20-bp alignment window on the CDS alignment. Protein domains were annotated by HMMER (Finn et al., 2011). We performed Monte Carlo simulations by randomly shuffling pointer locations on the CDS but keeping their original position distribution. This was implemented by a custom python script, which transforms the CDS to a circle, rotates pointer positions on the circle, and outputs the shuffled position on the re-linearized CDS (https://github.com/yifeng-evo/Oxytricha_Tetmemena_Euplotes/blob/main/Ortholog_comparison/shuffle_simulation.py). The null hypothesis of the Monte Carlo test is that pointer positions are conserved by chance. p-Value of Monte Carlo test is given by N expected>observed /N total (N expected>observed is the number of simulations when there are more conserved pointers in the simulation than the observation from real data, N total = 1000 in this study).

PCR validation of Russian doll locus
The complex Russian doll locus on MIC contig TMEMEN_MIC_21461 in Tetmemena was validated by PCR to confirm the Tetmemena MIC genome assembly. Tetmemena micronuclear DNA was purified as described previously and used as template for PCR using PrimeSTAR Max DNA polymerase (Takara Bio). 11 primer sets (Supplementary file 14) were designed to amplify products between 3 kb and 6 kb in length, with overlapping regions between consecutive primer pairs. The resulting PCR products were visualized through agarose gel electrophoresis, and bands of the expected size were extracted using a Monarch DNA Gel Extraction Kit (New England Biolabs). The purified gel bands were cloned using a TOPO XL-2 Complete PCR Cloning Kit (Invitrogen), transformed into One Shot OmniMAX 2 T1R E. coli cells (Invitrogen), and individual clones were grown and their plasmids harvested with a QIAprep Spin Miniprep Kit (QIAGEN). The plasmid ends were Sanger sequenced, as well as the region where the Oxytricha MIC assembly contains inserted MDSs (Genewiz). Sanger sequencing reads were mapped to the Tetmemena MIC contig TMEMEN_MIC_21461 and visualized using Geneious Prime 2021.1.1 (https://www.geneious.com).

Additional files
Supplementary files • Supplementary file 1. Sequencing depth statistics for germline micronucleus (MIC) genome assemblies. *Sequencing data from Chen et al., 2014. **Raw reads were mapped to the MIC genome assembly by Minimap2 and Bowtie2 (Langmead and Salzberg, 2012). Average coverage was calculated with BBmap (sourceforge.net/projects/bbmap/) pileup. sh for macronuclear destined sequence-containing contigs in the MIC genome assembly.
• Supplementary file 2. Subcategories of repeat content in the three species. Repeat content of the three genomes, as annotated by Repeatmasker (Smit et al., 2013) with additional manual annotation of Telomere-Bearing Element (TBE)/Transposon of Euplotes crassus (TEC) elements. The numbers may differ from Figure 2A-C because some repeats are assigned as other germline micronucleus (MIC) categories in the pie charts (Methods). For example, a MIC region which is both an internally eliminated sequence (IES) and satellite, is assigned as IES in Figure 2A-C, but is counted as a satellite in this table.
• Supplementary file 3. Telomere-bearing elements (TBE)/transposon of Euplotes crassus (TEC) elements open reading frames in three species. * Differs from 10,109 in Chen et al.  because we used different versions of BLAST and custom python scripts to identify complete TBEs (see Methods).
• Supplementary file 4. Orthology among scrambled and nonscrambled genes in the three species. * Ciliate database is generated by extracting all protein sequences in phylum Ciliophora (taxid: 5878) from NR database.
• Supplementary file 5. Summary of orthologs in each pair of species. The (i,j) cell shows the number of genes in species i with an ortholog in species j. * Genes with no ortholog detected by OrthoFinder (Emms and Kelly, 2019) in the other two species.
• Supplementary file 6. More scrambled somatic macronucleus (MAC) contigs contain at least one paralogous macronuclear destined sequence that may be involved in alternative rearrangement.
• Supplementary file 8. Genes with expression support in the three species.
• Supplementary file 9. Presence of conserved pointers in three species, with Monte Carlo simulations.
The following datasets were generated: