Characterization of Simple Sequence Repeats (SSRs) in Ciliated Protists Inferred by Comparative Genomics

Simple sequence repeats (SSRs) are prevalent in the genomes of all organisms. They are widely used as genetic markers, and are insertion/deletion mutation hotspots, which directly influence genome evolution. However, little is known about such important genomic components in ciliated protists, a large group of unicellular eukaryotes with extremely long evolutionary history and genome diversity. With recent publications of multiple ciliate genomes, we start to get a chance to explore perfect SSRs with motif size 1–100 bp and at least three motif repeats in nine species of two ciliate classes, Oligohymenophorea and Spirotrichea. We found that homopolymers are the most prevalent SSRs in these A/T-rich species, with AAA (lysine, charged amino acid; also seen as an SSR with one-adenine motif repeated three times) being the codons repeated at the highest frequencies in coding SSR regions, consistent with the widespread alveolin proteins rich in lysine repeats as found in Tetrahymena. Micronuclear SSRs are universally more abundant than the macronuclear ones of the same motif-size, except for the 8-bp-motif SSRs in extensively fragmented chromosomes. Both the abundance and A/T content of SSRs decrease as motif-size increases, while the abundance is positively correlated with the A/T content of the genome. Also, smaller genomes have lower proportions of coding SSRs out of all SSRs in Paramecium species. This genome-wide and cross-species analysis reveals the high diversity of SSRs and reflects the rapid evolution of these simple repetitive elements in ciliate genomes.


Introduction
Simple sequence repeats (SSRs), also known as tandem repeats, are abundant components present in all known genomes. They are major contributors of genome repetivity and are associated with transposable elements [1][2][3][4]. Homopolymer runs and microsatellites are two well-known representatives of SSRs. These repeats are usually insertion/deletion (indel) mutation hotspots that cause replication slippage of DNA polymerases. They could lead to high genome instability thus causing certain diseases, for example Lynch syndrome, a hereditary non-polyposis colorectal cancer in humans [5][6][7][8]. The high indel mutation rate of SSRs increases genetic variation between individuals in a population, making SSRs suitable tools for developing genetic markers and for studies of population genetics in a variety of organisms; tandem repeats of amino acids may also facilitate rapid generation of morphological variation [9][10][11][12][13][14].
Ciliates are microbial eukaryotes with high species and genomic diversity, and are characterized by nuclear dimorphism [15][16][17][18][19][20][21][22][23]. The macronucleus is transcriptionally active whereas the micronucleus  [37] A/T, A/T content of the genome; Class, the taxonomic class in which the species is; G, genome size; MAC, macronucleus; MIC, micronucleus; n, number of overlapping genes; N50, scaffold N50; Platform, genome sequencing platform; TNG, total number of genes in the genome; a , not including internally eliminated sequences (IES)-less genes; b , genes only predicted in non-maintained macronuclear chromosomes, which are lost after macronuclear differentiation.

Analysis of Simple Sequence Repeats (SSRs)
Perfect SSRs with motif size 1-100 bp (each motif has ≥3 repeats; no SSR with motif size >100 bp was detected in any genomes involved in this study) were detected with a Perl program originally developed by Dr. Way Sung, University of North Carolina, Charlotte. This program applies a greedy algorithm to find the maximum number of repeats. For motifs nested in one SSR, which are rare, only the smallest motif was counted. Details are described in Sung et al. [38]. Codons in SSRs were iterated from coding sequences of each genome, with both the strand and starting codon position taken into account. All statistical tests were carried out in R 3.4.4 [39]. Plotting was performed using R packages ggplot2 and ggpmisc.

Results
The detailed genomic features of the nine ciliate species are shown in Table 1. All genomes are A/T-rich (A/T content: 68.30%-84.09%; Table 1) with a wide range of genome sizes and total gene numbers. The species belong to one of two ciliate classes: Oligohymenophorea (Ichthyophthirius multifiliis, Paramecium biaurelia, P. caudatum, P. sexaurelia, P. tetraurelia, Pseudocohnilembus persalinus, Tetrahymena thermophila) and Spirotrichea (Oxytricha trifallax, Stylonychia lemnae). Most macronuclear chromosomes in the two spirotricheans are extremely fragmented and amplified during genome rearrangement.

Size Distribution and A/T Content of SSRs
SSRs are abundant in all macronuclear genomes, accounting for~7.59% to 11.97% of the whole genome (Table 2; Figure 1). Such abundance is strongly correlated with the genome-wide A/T content (Pearson's r = 0.94, p = 0.0002). This confirms that the more polarized the A/T content, the more repetitive the genome. Here, we define a motif as the shortest repeating unit of any given SSR. SSRs with motif sizes 1-10 bp are more abundant than those with longer motifs, especially mononucleotide repeats as homopolymer runs, such as (A)n, (C)n, (G)n, and (T)n (Table 2; Figure 1). In addition to these homopolymer motifs, there are another 166 motifs with sizes of 2-6 bp that are shared in all nine species (Supplementary Table S1). These motifs form similar microsatellite sequences, but their distribution and repeat number do not show specific relevance to each other. All numbers are percentages, except for those in the r1, r2, and RPG columns.   All numbers are percentages, except for those in the r1, r2, and RPG columns.  The number of repeats decreases as the motif gets larger ( Figure 2). Interestingly, there are peaks at 8-bp motifs in the two spirotricheans, O. trifallax and S. lemnae, with (G)4(T)4 or (A)4(C)4 at the ends of scaffolds being the majority (50.22% and 70.92%, respectively; Figure 1). These repeat motifs are known telomeric sequences that are added mostly to the ends of the gene-sized chromosomes by telomerases during macronuclear development. However, there are extremely rare internal telomeric repeats, defined as (G)4(T)4 or (A)4(C)4 motifs repeated at least twice in contigs with telomeric repeats at both ends and not located at the first or last 10% of the contigs. In S. lemnae, 36 possible internal telomeres are distributed in 36 gene-sized chromosomes; in O. trifallax, 39 in 38 chromosomes (Supplementary  Table S2). However, the presence of 1000-1500 internal telomeres in the micronuclear polytene chromosomes has been previously reported in S. lemnae [40,41]. This indicates that most internal telomeres are eliminated or rearranged during macronuclear development, or unknown internal telomeric sequence difference exists between the macronucleus and micronucleus, as previously reported in T. thermophila [42]. In addition, both species have numerous extremely short, gene-sized (i.e., <1 kbp) chromosomes. This is consistent with the assertion that extreme genome fragmentation and amplification increases genome repetivity. By contrast, motifs larger than 10 bp are rare, especially in the two spirotricheans, the assembly scaffolds of which are extremely short ( Table 1).
indicates that most internal telomeres are eliminated or rearranged during macronuclear development, or unknown internal telomeric sequence difference exists between the macronucleus and micronucleus, as previously reported in T. thermophila [42]. In addition, both species have numerous extremely short, gene-sized (i.e., <1 kbp) chromosomes. This is consistent with the assertion that extreme genome fragmentation and amplification increases genome repetivity. By contrast, motifs larger than 10 bp are rare, especially in the two spirotricheans, the assembly scaffolds of which are extremely short (Table 1). The A/T content of SSRs is significantly higher than that of the corresponding genomes (one-sided paired t-test, t = -21.563, df = 8, p = 1.13 × 10 −8 ; Tables 1, 2) and they are strongly correlated (r = 0.90, p = 0.0008). The higher A/T content of SSRs is likely due to the dominance of A/T The A/T content of SSRs is significantly higher than that of the corresponding genomes (one-sided paired t-test, t = -21.563, df = 8, p = 1.13 × 10 −8 ; Tables 1 and 2) and they are strongly correlated (r = 0.90, p = 0.0008). The higher A/T content of SSRs is likely due to the dominance of A/T homopolymers in SSRs (Table 2). This domination also elevates the median A/T content of SSRs in all nine species almost to 1.0 ( Figure 2). A/T content generally decreases as motif size gets larger ( Figure 2; Table 2).

Association between SSRs and Genome Architecture
It is known that repetitive elements contribute to the generation or positional rearrangement of overlapping genes [43,44], for example, in mosquitos the overlapping events are significantly associated with the microsatellite sequences' amount in the overlapped genes. The microsatellite sequences might have facilitated the crossover events, which lead to positional rearrangement of neighboring genes [44]. Thus, we ask whether ciliate genomes with more SSRs would have more overlapping genes. The proportion of overlapping genes and the proportion of SSRs in the genome are not correlated with each other (Pearson's r = 0.55; p = 0.12), giving no significant support to the assertion that SSRs elevate the number of overlapping genes. Nonetheless, the possibility that such lack of correlation is an artifact caused by insufficient annotation quality cannot be excluded. It is noteworthy that there are only three species with overlapping genes and the two with the most overlapping genes, i.e., Paramecium tetraurelia and Tetrahymena thermophila, have the best-annotated/maintained genomes (Table 1).
We also ask the question whether SSRs in the macronuclear and micronuclear genomes follow the same size distributions. Due to the paucity of available micronuclear genomes, only O. trifallax and T. thermophila are included in this analysis. In O. trifallax, for the same motif size, there are more SSRs in the micronuclear genome than in the macronuclear genome, except for those with 8-bp motifs ( Figure 3). Of these repeat motifs, 50.22% are in telomeres, probably because the chromosomes are extensively fragmented and amplified during macronuclear development. In O. trifallax, 8-bp-motif SSRs account for about 9.46% of all non-homopolymer SSRs in the macronuclear genome, whereas this proportion is only 0.04% in the micronuclear genome. By contrast, in T. thermophila, a species with low levels of genome rearrangement, micronuclear SSRs are universally more abundant than the macronuclear SSRs, i.e., there is higher repetivity in the micronuclear than the macronuclear genome ( Figure 3). lack of correlation is an artifact caused by insufficient annotation quality cannot be excluded. It is noteworthy that there are only three species with overlapping genes and the two with the most overlapping genes, i.e., Paramecium tetraurelia and Tetrahymena thermophila, have the best-annotated/maintained genomes (Table 1).
We also ask the question whether SSRs in the macronuclear and micronuclear genomes follow the same size distributions. Due to the paucity of available micronuclear genomes, only O. trifallax and T. thermophila are included in this analysis. In O. trifallax, for the same motif size, there are more SSRs in the micronuclear genome than in the macronuclear genome, except for those with 8-bp motifs (Figure 3). Of these repeat motifs, 50.22% are in telomeres, probably because the chromosomes are extensively fragmented and amplified during macronuclear development. In O. trifallax, 8-bp-motif SSRs account for about 9.46% of all non-homopolymer SSRs in the macronuclear genome, whereas this proportion is only 0.04% in the micronuclear genome. By contrast, in T. thermophila, a species with low levels of genome rearrangement, micronuclear SSRs are universally more abundant than the macronuclear SSRs, i.e., there is higher repetivity in the micronuclear than the macronuclear genome ( Figure 3). In order to show more specific SSR patterns, we picked two genes (MTA6, MTB6; each contains one internally eliminated sequence (IES); NCBI accession numbers: KC405252.1, KC405257.1) in the T. thermophila mating type gene family, which are well-studied and have clear gene structural annotations [45]. For each gene, we ran the SSR pipelines and aligned the MDSs (Macronucleus-Destined Sequences) in the micronuclear genome with those in the macronuclear genome (Supplementary  Table S3). Consistent with the genome-wide comparison shown in Figure 3, after taking into account all sites of both genes, the macronuclear genes have fewer SSRs than the micronuclear ones. We also parsed out micronuclear intronic SSRs of the two genes and aligned them with those in the macronuclear introns. These conserved SSRs (at least in the two focal genes) do not only include homopolymers such as 5 AAAAAAAA3 , 5 AAAAA3 , but also include microsatellites 5 AATAATAAT3 , 5 ATATAT3 , 5 TATATA3 . The specific functions for these SSRs are unclear, and they could be motifs associated with the rearrangement process. Analyzing SSRs in MDSs shared by both MIC and MAC MTA6 and MTB6 genes, we found that~50% of SSRs have a higher copy number in the macronucleus than in the micronucleus, with the remaining~50% being equal in the two nuclei. As mentioned above, the total number of SSRs in the two genes (full length) are higher in the micronucleus than in the macronucleus, thus implying that IESs greatly elevate the repetitiveness of the micronuclear genome. This observation from the two genes might be extended to whole-genome-level, although a robust test with fully-annotated macronuclear and micronuclear genomes would be needed. We also found a few SSRs unique to the macronuclear MDSs (i.e., not present in the corresponding MIC genes), for example, 5 CTCCTCCTC3 , 5 CTGCTGCTG3 , 5 GCTGCTGCT3 , 5 TCTCTC3 , 5 TGCTGCTGC3 in MTA6; 5 AACAACAAC3 , 5 AGCAGCAGC3 , 5 AGTAGTAGT3 , 5 CTTCTTCTT3 , 5 GAGAGA3 , Table S3), suggesting that novel SSRs might be created during the rearrangement process.

TGGTGGTGG3 in MTB6 (Supplementary
Since some tandem repeats with 10-20 bp repeat units are involved in the genome rearrangement [46], we searched SSRs with repeat motifs of 10-20 bases in the micronuclear and macronuclear genomes of both Tetrahymena thermophila and Oxytricha trifallax (Supplementary Table S4). These SSRs are more abundant in the micronucleus than in the macronucleus (42 in the micronucleus vs. 25 in the macronucleus of T. thermophila, and among them 10 are shared with mostly the same sequence and length in both genomes; 368 vs. 8 in O. trifallax and 4 are shared; Supplementary Table S4) and are distributed evenly along the scaffolds/chromosomes in both genomes. We also compared these SSRs to those previously published. Interestingly, two identical 19mer SSRs have been detected in two different micronuclear scaffolds (5 ATTATTTCTTTTTACATTT3 ; Supplementary Table S4). These are known tandem repeats in Tlr1 [Tetrahymena long repeat 1; a member of a gene family with 20-30 DNA elements encoding a polynucleotide transferase; 45], which is involved in genome rearrangement of T. thermophila [47] (Supplementary Table S4). This example and the identification of other 10-20bp SSRs confirm the quality of the genomes, the fidelity of the analysis, as well as provide unexplored SSR candidates possibly functioning in the genome arrangement process of both T. thermophila and O. trifallax.

SSRs in Coding Regions
SSRs are evenly distributed in gene regions, without upstream or downstream biases ( Table 2, RPG). As is shown in Figure 4, the top four codons in SSRs of all nine species are AAA (codes for lysine, a charged amino acid), TTT (phenylalanine, a hydrophobic amino acid), GGG (glycine, a hydrophobic amino acid), and CCC (proline, a hydrophobic amino acid). This is consistent with the observation that the vast majority of SSRs are homopolymers. In order to identify codons that are frequently repeated in coding regions, or possibly most tolerated by the gene, we analyzed codons that are repeated more than 10 times. Isoleucine (hydrophobic), asparagine (hydrophilic), leucine (hydrophobic), tyrosine (hydrophilic), and glutamic acid (charged) codon repetitions are the most abundant in most species. Ichthyophthirius multifiliis, Paramecium biaurelia, P. sexaurelia, and P. tetraurelia are the four species with the highest numbers of repeated codons (Table 3). Of the oligohymenophoreans, P. caudatum seems to have extremely rare repeated codons. This result suggests that in the four Paramecium species included in In order to identify codons that are frequently repeated in coding regions, or possibly most tolerated by the gene, we analyzed codons that are repeated more than 10 times. Isoleucine (hydrophobic), asparagine (hydrophilic), leucine (hydrophobic), tyrosine (hydrophilic), and glutamic acid (charged) codon repetitions are the most abundant in most species. Ichthyophthirius multifiliis, Paramecium biaurelia, P. sexaurelia, and P. tetraurelia are the four species with the highest numbers of repeated codons (Table 3). Of the oligohymenophoreans, P. caudatum seems to have extremely rare repeated codons. This result suggests that in the four Paramecium species included in the present study, the relative abundance of coding SSRs is strongly correlated with genome size (adjusted R 2 = 0.98, p = 0.006; Tables 1 and 2). However, when all nine species were analyzed, the correlation is not significant (adjusted R 2 = 0.13, p = 0.19). Table 3. Total counts of SSRs with codon repeats (>=10) in the nine ciliate genomes.

Discussion
In this study, we investigated perfect SSRs in nine ciliate species for which high-quality genomic data are available in order to determine their size distribution, A/T content, repeated codons, and their association with other genomic features. Nevertheless, characterization of SSRs is not the equivalent of a comprehensive investigation of genome repetivity since similar studies have yet to be carried out on large repetitive elements, e.g., transposable elements.
A/T content generally decreases as motif size increases ( Figure 2; Table 2), which is consistent with the observation of minisatellites (motif size > 10 bp) being GC-rich in other organisms [48]. In the macronuclear genomes of all the nine ciliates in this study, we also confirm that A/T content of each single motif is also associated with A/T content of the flanking region (the two nucleotides flanking each SSR; Pearson's r~1, p < 2.20 × 10 −16 ), which indicates the origin of non-dispersal repeats.
We found that A/T content is strongly associated with SSR abundance. In comparison with other protists, the level of SSR content in ciliates is similar to that of the malaria pathogen Plasmodium falciparum (~9% of the genome is SSRs; A/T content 80.67%) [49], while it is much lower than that of Trypanosoma cruzi (~30% of the genome is SSRs; A/T content 48.30%) [50], suggesting that the positive correlation between A/T content and SSR abundance is not a general rule in protists, and infers diversifying mechanisms in genome repetitive elements evolution.
Amino acid repeats in proteins are known to play important roles in pathogenesis, cell interaction, motility, cytoskeleton and morphological evolution [13,51,52]. In parasitic ciliates such as Ichthyophthirius multifiliis and Cryptocaryon irritans, amino acid repeats are important components of the cell surface immobilization antigens (i-ags), which are targets of host antibodies, and codons for amino acids repeats are usually repeated also at the DNA level [53][54][55]. These repeats could cause unequal crossover, creating new alleles and thus increasing antigen diversity. Such recombinogenic expansion of surface antigens might be an adaptive strategy to increase the survival of parasitic ciliates when facing the harsh environment of host secretions. Therefore, the unstable nature of SSRs/tandem repeats could be partially advantageous for ciliate genome evolution, especially for parasitic species.
Across all the ciliate species in this study, the most abundant 3-bp SSRs in coding regions are AAAs, which code for lysines. Lysine-repeats are the most abundant amino-acid repeats in the pellicle alveolins of the alveoli, which are important cellular structures in ciliates for occupying diverse habitats and reflect highly divergent protein evolution [51,[56][57][58]. This finding suggests that the SSR motifs are conserved in ciliates with different morphology and life histories. Homopolymers are prone to occur in non-coding regions ( Table 2, coding SSR proportion column). It has previously been suggested that homopolymers in non-coding regions can be involved in protein binding, e.g., as upstream promoter elements [59], which implies that the presence of SSRs might be a key factor in driving genome evolution in ciliates. Besides, repeated-codons (>=10 repeats) are rare, potentially as a result of stronger selection against gene mis/dysfunction caused by repetivity in smaller genomes.
In ciliates, the macronucleus is resorbed in each sexual cycle, and its evolution is more driven by epigenetic mechanisms other than classical genetic mechanisms. Relating macronuclear SSRs to the genome evolution of ciliates thus seems to be difficult; however, the macronuclear genome structurally corresponds to the macronucleus-destined sequences in the micronucleus, and the haploid genome sizes of the macronucleus and micronucleus do not usually differ much in most ciliates. In other words, studying macronuclear SSRs' roles in genome evolution is like an investigation by subsampling the short repetitive elements in the MIC genome (as is shown in Figure 3), with the assumption that short non-IES (internally eliminated sequences) repeats are conserved in both the MAC and MIC, although this might not always be true especially in species with highly fragmented and scrambled genes. Of course, a full picture of SSRs in genome evolution would definitely need the micronuclear genome sequences well annotated in more species.

Conclusions
This genome-wide and cross-species analysis reveals general features of ciliate SSRs and demonstrates the association between SSRs and the unique genome architectures of ciliates. SSRs might thus be an important driver in genome evolution of this large, charismatic group of microbial eukaryotes.