Russian Doll Genes and Complex Chromosome Rearrangements in Oxytricha trifallax

Ciliates have two different types of nuclei per cell, with one acting as a somatic, transcriptionally active nucleus (macronucleus; abbr. MAC) and another serving as a germline nucleus (micronucleus; abbr. MIC). Furthermore, Oxytricha trifallax undergoes extensive genome rearrangements during sexual conjugation and post-zygotic development of daughter cells. These rearrangements are necessary because the precursor MIC loci are often both fragmented and scrambled, with respect to the corresponding MAC loci. Such genome architectures are remarkably tolerant of encrypted MIC loci, because RNA-guided processes during MAC development reorganize the gene fragments in the correct order to resemble the parental MAC sequence. Here, we describe the germline organization of several nested and highly scrambled genes in Oxytricha trifallax. These include cases with multiple layers of nesting, plus highly interleaved or tangled precursor loci that appear to deviate from previously described patterns. We present mathematical methods to measure the degree of nesting between precursor MIC loci, and revisit a method for a mathematical description of scrambling. After applying these methods to the chromosome rearrangement maps of O. trifallax we describe cases of nested arrangements with up to five layers of embedded genes, as well as the most scrambled loci in O. trifallax.

only involve loss of IESs, and do not require alteration to the order or orientation, relative to the precursor locus. The simplest case is a nanochromosome that originates from a single MDS i.e., an IES-less locus (Chen et al. 2014).
Scrambled rearrangements involve MDSs in different order and/or orientation. Further, we can establish varying levels of complexity, by classifying and ranking rearrangements that require multiple, sequential topological operations to produce the end product, as studied previously by Burns et al. (2016a).
The scrambled nature of the MIC permits cases in which MDSs for multiple nanochromosomes can interleave with each other (Chen et al. 2014). A special case arises when all MDSs for a MAC chromosome are fully contained between two MDSs (i.e., within an IES) of another nanochromosome, giving rise to nested loci. Interleaving and nesting can also have varying degrees of complexity, or layers of depth, akin to Russian dolls. We note that nested and even Russian doll genes do exist in metazoa, usually as whole genes within introns (Assis et al. 2008).
Most arise from gene duplication and insertion of young genes or transposons into long introns (Sheppard et al. 2016;Wei et al. 2013;Gao et al. 2012).
In this study, we follow up on previous analyses in Burns et al. (2016a), that described the most commonly recurring scrambled patterns across the genome. Here, we present the most elaborate cases of genome rearrangement in O. trifallax, which highlight the extraordinary degree of topological complexity that can arise from such a highly plastic genomic architecture.

METHODS
Genome sequences From Oxytricha trifallax MIC and MAC sequences from Oxytricha trifallax were obtained from the ,mds_ies_db. website (Burns et al. 2016b), in the form of annotated tables, and processed as described in Burns et al. (2016a), available from http://knot.math.usf.edu/data/scrambled_patterns/processed_annotation_ of_oxy_tri.gff. These were parsed to produce the various types of graphical representations described below using a combination of in-house Python and SQL scripts.

Graphical representations of scrambled loci
These representations highlight different aspects of chromosomal rearrangement topologies. For the analysis of nested chromosomes we filtered the data to consider only cases with scrambled or nonscrambled MDSs for multiple MAC contigs (nanochromosomes) that derive from a single MIC region. Figure 1A shows a schematic view of the MIC (precursor) and MAC (product) versions of such a locus, and the correspondence between MDSs and pointers. In addition, we include a condensed linear sequence representation of the MDSs ( Figure 1B), a self-intersecting line corresponding to the MIC region that indicates the topological orientation of the product in a single trace ( Figure 1C), and a chord diagram representation of the pointer list for one of the MAC chromosomes ( Figure 1D).

Nested chromosomes
We define a locus as nested if all or some of the MDSs for one nanochromosome are surrounded in the MIC by MDSs for a different nanochromosome. Given that nesting can be layered, we define an insertion depth index (IDI) that recursively counts nesting events on an IES of a given nanochromosome. The IDI of the nanochromosome is defined to be the maximum IDI across all of its IESs. For example, in Figures 1 and 2 the IDI values of the orange, blue and red contigs are 0, 1, and 2, respectively. MDSs that map to distinct MAC contigs whose terminal sequences overlap (Chen et al. 2014) were not considered in our analysis. Conversely, the embedding index (EI) represents the maximal depth of an MDS that resides between the MDSs for another MAC chromosome, counting the layers or levels surrounding it. In Figures 1  and 2, the EI value for each MDS in the red contig is 0, for the blue contig is 1 and for the orange contig is 2. The EI of a nanochromosome is defined to be the maximum EI across all of its MDSs. Note that a single MDS can be more deeply embedded than an entire gene locus. (b) A condensed, linear description of the precursor MIC contig in (a). The notation "1 2 7 odd" represents MDS 1; 3; 5; 7; and the black triangle indicates the presence of three nonscrambled MDSs for the orange MAC contig shown in (a). (c) A graph representing the precursor MIC contig in (a) where the vertices (intersection points) indicate the recombination junctions (pointers). The arrowhead indicates the orientation of the precursor MIC contig, reading left to right as shown in (a). The segments highlighted in 3 colors indicate the MAC contigs obtained after joining MDS segments. The uncolored segments correspond to IESs. The vertices of the blue portion correspond to the pointers 1; 2; 3; 4 that join respectively flanking MDSs. The orange portion with a loop corresponds to the middle orange MAC locus with three nonscrambled MDSs. The number 2 indicates removal of two conventional IESs. (d) A chord diagram representing a cyclic arrangement of pointers of MDSs from the blue MAC contig in (a) within its precursor MIC contig. Vertices are labeled in order of pointer appearances in the MIC contig, and chords (line segments within the circle) connect the two copies of the matching pointers.

Figure 2
A specific example of three layers of nesting. The two MIC loci for MAC contigs Contig9583.0 (blue) and Contig6683.0 (orange) are nested between Contig6331.0 (red) on the micronuclear contig ctg7180000067077. Not drawn to scale. The red locus has an IDI of 2, the blue locus 1, and the orange locus 0.

Highly scrambled loci
We previously reported that most scrambled loci in O. trifallax contain iterative combinations of specific scrambled patterns, that can account for a large majority (96%) of all scrambled genes (Burns et al. 2016a). For example, the MAC contig represented by red MDSs in Figure 1 is scrambled, with the MDS order in the MIC locus M 1 M 3 M 5 M 7 ⋯M 6 M 4 M 2 ; a cluster of odd MDSs in consecutive order, separated from the corresponding even numbered segments in reverse consecutive order. Scrambled genes with an odd-even pattern were first discovered for a single gene in Mitcham et al. (1992).
Iterating such patterns (inversions, translocations, and odd/even splittings) permits a reduction of complexity for each locus (possibly mimicking the evolutionary or developmental steps that occur in nature). Mathematically, we use double occurrence words (defined next) and chord diagrams to describe genome rearrangements.
The order of pointer sequences in a rearrangement map can be used as words, and because every pointer appears twice in the MIC locus, the list of pointers forms a double occurrence word (DOW). In the simplest case, pointers that flank a nonscrambled IES between two consecutive MDSs, such as all MDSs in the orange locus, appear in DOWs as pairs of identical, consecutive symbols. Hence the central orange locus in Figure 1, M 1 M 2 M 3 ; would be represented by the DOW 1122, where 1 represents the pointer sequence flanking the first IES between M 1 and M 2 ; and 2 is the pointer flanking the second IES, joining M 2 to M 3 : More generally, pointer i is the short repeat (microhomology) present at the end of M i and beginning of M iþ1 : Because this study of more complex loci focuses on scrambled patterns, all pairs of neighboring identical pointers can be ignored, and we have eliminated them for simplification. This may also reflect the rearrangement steps during development, since Möllenbeck et al. (2008) observed simple IES elimination before MDS reordering or inversion.
For a scrambled example, in Figure 1A and B, with MDSs labeled as numbers in boxes, "1 2 7 odd" in B corresponds to red MDSs M 1 M 3 M 5 M 7 ; whose corresponding pointer list is 123456. The remaining red boxes appear as numbered 6; 4; 2; which has a pointer list 654321. Therefore, the pointer list corresponding to the red locus is 123456654321 in the MIC contig, a canonical odd-even pattern (and the 66 in the center is a scrambled pointer junction). Usually, in the Oxytricha genome such odd-even patterns appear in the DOW as segments that are repeated or reversed. For example, 1234⋯1234 is a repeat word corresponding to M 1 M 3 M 5 ⋯M 2 M 4 and 1234⋯4321 is a return word corresponding to MDS sequence M 1 M 3 M 5 ⋯M 4 M 2 : The pointer numbers at the top of the scrambled blue MDSs in Figure  1A read 12341243. Thus the DOW representing this MAC contig is 12341243. In this case 12 Á 12 is a maximal repeat word inside. The first step of reduction removes these maximal repeat words. The remaining word 3443 is a return word or perfect inverted repeat. In a second step of reduction, we eliminate this return word, leaving the empty word є. Note that we use this iterative process to characterize the complexity of a scrambled locus, but it may or may not reflect either the pathway for gene descrambling or the evolutionary steps that led to its scrambling.
We describe these patterns by chord diagrams associated with each DOW. In a chord diagram the symbols of a DOW are placed in order from a reference point on a circle (marked by a small bar). The diagram is obtained by connecting two identical symbols placed on a circle with a chord (see Figure 1D). The chord diagrams corresponding to repeat and return words have the form depicted in Figure 3.
Repeat and return words can appear nested within one another in Oxytricha's scrambled genome. For example, the word 121342566534 contains the return word 5665 nested within the repeat word 3434. After removal of both 5665 and 3434, the word reduces to 1212 which is again a repeat word, which further reduces to є. In Burns et al. (2016a) we found that over 90% of scrambled contigs can be reduced by various combinations of these two operations. Therefore their topological complexity can be broken down into simple steps, which may have arisen by layer upon layer of germline translocations that introduce or propagate odd-even patterns (Chang et al. 2005;Landweber 1998).   Here, we identify rare patterns in the genome and exceptional cases of nested genes. We do so by recursively removing repeat and return patterns for the genome-wide dataset analyzed in Burns et al. (2016a), and we retain those contigs whose descriptions cannot be further reduced.

Data Availability
The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

RESULTS AND DISCUSSION
The Oxytricha trifallax genome contains deeply nested loci The interleaving depth index (IDI) was computed for 15,811 MAC contigs and is summarized in Table 1. Two exceptional MIC loci each contain the nested precursor segments for four other MAC contigs, and this represents the highest level of nesting, IDI = 4 (or 5 nested genes). These two MIC loci are shown in Figures 5A and 5B and their topological representations are shown in Figures 6A and 6B, respectively. 1301 (8%) of MAC contigs contain MDSs for other nanochromosomes nested within them in the MIC (IDI of 1 or greater). Among these, scrambling is more common, when compared with the genome-wide rate of scrambling (x 2 test, P , 0.01). Furthermore, higher IDI values correlate with a higher proportion of scrambled loci (Spearman's r=1, P , 0.05). EI, on the other hand, does not correlate with the presence or absence of scrambling. 22% of MAC chromosomes are scrambled in the MIC for EI values 1, 2, and 3 (Table 1), which is indistinguishable from the genome-wide average (Chen et al. 2014). There is no significant correlation between the number of MDSs per MAC locus and IDI (or EI), but among nested loci, we note that those within inner layers tend to have fewer MDSs than those in outer layers, perhaps as a result of recent insertion.
The evolutionary steps that produce nested genes may favor nanochromosomes that are themselves scrambled, since their longer pointers (Chen et al. 2014) might facilitate rearrangement across greater distances. Additionally, the distance between two MDSs in the MIC that are consecutive in the MAC is typically longer for scrambled vs. nonscrambled cases (Chen et al. 2014), which provides more opportunity for insertion of other MDSs between them. The elements they contain (the innermost Russian dolls), themselves, do not display any bias for being scrambled or nonscrambled.
We further classified loci as interleaving or embedded. Interleaving occurs when some (but not all) MDSs of a nanochromosome reside between MDSs for another MAC contig. A locus is considered embedded when all MDSs for a MAC contig are contained within the MIC locus of another MAC contig (Table 2). Interleaved cases show the highest proportion of scrambled loci.

Highly scrambled and atypical loci
The rearrangement topology of 96% of the scrambled chromosomes in O. trifallax, even large ones with hundreds of scrambled MDSs (Chen et al. 2014), can be explained by combinations of two types of operations: the mixture of odd-even patterns, or repeat and return words (Burns et al. 2016a). However, 176 scrambled loci were identified that could not be reduced by iterative application of these reduction operations. In this section we describe the 22 cases that contain at least 4 scrambled pointers after this reduction. Figure 4 shows an example of one the 22 contigs, with nested pairs of repeat and return words in blue and green, respectively. The reduction process leaves 6 scrambled pointers with DOW 4; 10; 11; 8; 12; 11; 9; 8; 4; 9; 10; 12 (commas added to separate the symbols). This word has no appearances of a repeat or return pattern. Table 3 lists the predicted genes encoded on these 22 nanochromosomes (Burns et al. 2016a), and Supplemental Material, File S1 describes the MIC loci, the MDSs for these scrambled cases, and their scrambled rearrangement paths as chord diagrams. All but 7 chromosomes encode single genes. The remaining 7 chromosomes each encode two or three genes. Scrambled pointers that cannot be explained as Figure 6 Graphical representations of the recombinations of the MIC loci depicted in Figure 5. Not drawn to scale. The vertices indicate the recombination sites corresponding to alignments of pointers. Loops mark conventional (nonscrambled) IESs, which occur whenever sequential MDSs are present successively on the MIC. The number inside each loop indicates the number of conventional IESs. Colored edges indicate MDSs of MAC contigs. The colors of the MAC contigs are the same as in Figure 5. More colors were added to indicate the nonscrambled contigs corresponding to the triangles in Figure 5.
n recursive combinations of odd-even patterns (considered irreducible in Burns et al. (2016a)) are shown as dark lines in the chord diagrams (and red nodes in Figure 4). Some of the most complex cases have over 50 scrambled MDSs. The two largest cases are over 100 MDSs; however, nearly all of those MDSs are nonscrambled. Only seven MDSs are scrambled in these loci and they share the same chord diagrams, but these two MAC chromosomes are actually paralogous (75% similar at the nucleotide level), arising from a tandem duplication of a 77 kb region in the MIC genome.
The simplest case among the 22 loci is just 8 MDSs in the order M 2 M 3 M 8 M 7 M 1 M 4 M 5 M 6 : After eliminating the 3 conventional (nonscrambled) IESs, this map simplifies to M 2 M 5 M 1 M 3 M 4 ; which requires descrambling at just 4 pointer junctions and contains no ostensible pattern. However, its scrambled pointer list 12413234 contains the pattern 121323 (as a substring after pointer 4 is recombined or removed).
The graph representation of this pattern is a tangled cord, which contains slinky-like coiled circles (Burns et al. 2013). The simplest form of an MDS rearrangement map corresponding to a tangled cord is M 1 M 2 M 3 : Its pointer list 1212 is also a repeat word, however, that equally describes the odd-even map M 1 M 3 M 2 ; as well as M 2 M 1 M 3 ; which is not odd-even because of the inversion. Therefore, a tangled cord is defined recursively by the DOW, or pointer list, that corresponds to the MDS rearrangement map: 1212 is the simplest case. The next is 121323, and 12132434 etc. The DOW for the pointer list grows by adding a new pair of identical symbols after the last and penultimate symbols of the previous DOW.
For another example, the tangled cord pointer list 121323 describes both of the following rearrangement maps, which appear among the other 154 MIC contigs with fewer than 4 scrambled pointers remaining after eliminating all odd-even layers: M 1 M 2 M 4 M 3 and M 2 M 1 M 4 M 3 in the MIC loci for MAC contigs 567.1 and contig 17193.0, respectively.
We find that the tangled cord can be detected in all but one (contig101.0) of the 22 loci in this study. This includes the cases with the fewest scrambled MDSs (Table 3 and File S1). Such embedding of the tangled cord appears highly prevalent among these 22 loci, but whether the pattern appears more often than would be expected by chance remains to be tested (Jonoska et al. 2017).

PERSPECTIVES
Our results show that a great diversity of scrambled gene maps are present in the germline genome of O. trifallax. The presence of such n  others, means that whenever one of its MDSs is found within an IES of another MAC contig, all of the MDSs reside within the same IES. 2 : A MAC contig is interleaved with another if at least one MDS of each contig is found on an IES of the others. To say that a MAC contig can only be found interleaved with others, means that whenever an MDS of it is found on another MAC contigs, that contig also has an MDS on the IES of the original contig.
highly nested architectures was a surprise and suggests that layers upon layers of MDS and gene translocations constantly alter the genome, while the detection of highly scrambled patterns reveals architectures that transcend simple twists and turns of the DNA. We present new metrics of topological genome complexity, that go beyond the linear nature of eukaryotic chromosomes and consider their deeply structured and layered history. While several models of genome rearrangement have been reported, the unprecedented levels of rearrangements in this system necessitate additional descriptive and mathematical tools, some of which have applications in biology (Assis et al. 2008) while others generate new mathematical concepts as in (Burns et al. 2013). From this initial description, several open questions arise: Do nested architectures affect other genomic properties? Is there any relationship between either gene expression, or the rearrangement pathway, and chromosome nesting? A preliminary analysis of RNAseq data (Swart et al. 2013), suggests weak or no correlation between the IDI and the temporal order of gene expression during development. We anticipate that future studies of DNA rearrangements and transcriptional dynamics will provide insights into these questions. Furthermore, how do these patterns arise in evolution? Do nested and highly scrambled patterns accumulate gradually or in a catastrophic event, like chromothripsis (Maher and Wilson 2012)? Are the combinations of patterns serendipitous, or is there an biological process that drives the introduction of higher levels of scrambling? Future studies of related organisms should address population variation, and measure the level of variation, in chromosome structures at different scales of evolutionary divergence. In particular, surveys of the orthologous loci for the notable cases studied here in earlier diverged spirotrichous ciliates (as in Chang et al. (2005) and Hogan et al. (2001)) have the potential to reveal much about the evolutionary steps that gave rise to such complex, intertwined genome architectures.

ACKNOWLEDGMENTS
This work has been supported in part by the NSF grant CCF-1526485 and NIH grants GM109459 and GM59708. Rafik Neme is supported by the Pew Latin American Fellowship (Biomed LAF-2017-A-01888).