Transposable elements drive intron gain in diverse eukaryotes

Significance Introns are a crucial part of eukaryotic genomes, but their origins are poorly understood. Some lineages exhibit large-scale gains in introns extremely rapidly. This pattern might be explained by a type of genetic element, Introners, that creates copies of itself that insert into many genes across the genome. We searched thousands of eukaryotic genomes for Introners and found them in 5% of all species. Introners evolved convergently from many distinct genetic elements, most are consistent with DNA-based transposable elements, and they are disproportionately common in the genomes of aquatic organisms. We propose that horizontal transfer of transposons in aquatic taxa contributes to the biased and highly punctate evolution of intron gains across eukaryotes.

There is massive variation in intron numbers across eukaryotic genomes, yet the major drivers of intron content during evolution remain elusive. Rapid intron loss and gain in some lineages contrast with long-term evolutionary stasis in others. Episodic intron gain could be explained by recently discovered specialized transposons called Introners, but so far Introners are only known from a handful of species. Here, we performed a systematic search across 3,325 eukaryotic genomes and identified 27,563 Introner-derived introns in 175 genomes (5.2%). Species with Introners span remarkable phylogenetic diversity, from animals to basal protists, representing lineages whose last common ancestor dates to over 1.7 billion years ago. Aquatic organisms were 6.5 times more likely to contain Introners than terrestrial organisms. Introners exhibit mechanistic diversity but most are consistent with DNA transposition, indicating that Introners have evolved convergently hundreds of times from nonautonomous transposable elements. Transposable elements and aquatic taxa are associated with high rates of horizontal gene transfer, suggesting that this combination of factors may explain the punctuated and biased diversity of species containing Introners. More generally, our data suggest that Introners may explain the episodic nature of intron gain across the eukaryotic tree of life. These results illuminate the major source of ongoing intron creation in eukaryotic genomes.
intron | splicing | genome structure | evolution | comparative genomics The forces shaping intron-exon structures of eukaryotic genes remain among the longest-standing mysteries of molecular biology. Eukaryotic genomes contain from zero to hundreds of thousands of spliceosomal introns (1). Given the diverse roles of introns in gene expression and genome stability, from transcription enhancement to transcript surveillance to alternative splicing to R-loop avoidance (2)(3)(4)(5), these differences may have important functional implications. Intron numbers per gene and per genome exhibit complex phylogenetic patterns, indicating massive recurrent changes in intron numbers through evolution, and comparative analyses attest to important roles for both intron deletion (loss) and creation (gain) (1,6,7). Despite decades of debate, no consensus has emerged as to either the proximal or ultimate explanations for these patterns.
Diverse molecular mechanisms of de novo intron creation are known, but their relative contributions to genome evolution across the tree of life remain poorly understood. Proposed mechanisms of de novo intron creation include inexact double strand break repair (8), mitochondrial DNA insertion (9), internal gene duplication (10), and "intronization" of exonic sequence (11). In addition to these ad hoc intron creation mechanisms, the intron-generating transposable elements (TEs) known as Introners represent a mechanism that could explain the high and episodic frequency and genome-wide scale of intron gains observed across eukaryotic lineages. These poorly understood TEs create introns de novo through insertion into exons (12)(13)(14)(15)(16)(17). Introners have only been described in five eukaryotic lineages, and even among these cases, the precise molecular mechanisms remain obscure. Some Introner families show clear signatures of DNA TEs (12,15), while others may be novel RNA-propagated elements (14,16). More importantly, determining the extent to which Introners are a primary source of ongoing intron gain is essential for interpreting the evolution of genome structure and function and requires a broad survey that spans the eukaryotic tree of life.
By performing a systematic search and in-depth analysis of intron gain across all available eukaryotic genomes, we identified primary shared drivers of intron gain in diverse eukaryotic lineages. Our search identified 27,563 Introner-derived introns from 548 distinct families, with Introners found in 175/3,325 (5.2%) of studied genomes. Introner-containing species span remarkable phylogenetic diversity, from copepods to poorly understood basal protists, representing lineages whose last common ancestor dates back to ~1.7 billion years ago (18). Unexpectedly, aquatic organisms were 6.5 times more likely to contain Introners, and 74% of Introner-containing aquatic genomes harbored multiple distinct Introner families. Overrepresentation in aquatic organisms could reflect higher rates of lateral gene transfer. While we find that Introners are efficiently spliced, preferential presence in lowly expressed genes suggests that new insertions are costly. Most Introner families exhibit one or more signatures of DNA-based propagation. Our study indicates that susceptibility to acquire weakly deleterious Introners by lateral gene transfer might play the central role in a taxon's tendency to gain introns.

Results and Discussion
Introners Are Widespread Across Eukaryotes. Our survey across all available eukaryotic genomes indicates that Introners are abundant in diverse lineages. To search for Introners, we developed a pipeline to systematically identify groups of introns with similar sequences, for which the region of sequence similarity extends to near the splice boundary at both ends. This approach allows flexibility in identifying introns created by TE insertions through potentially complex mechanisms while excluding most cases where inter-intron similarities reflect secondary insertion of TEs or evolution of microsatellites within preexisting introns. We then applied this pipeline to 2,805 genomes representing 1,700 species with available genome annotations in Genbank (SI Appendix, Table 2). After extensive quality control (see Methods), our search revealed sets of Introners in 48 species grouping into eight distinct taxonomic groups representing six major eukaryotic groups. Although here we refer to each of these elements as "Introners," we do not mean to imply any direct evidence for shared homology among the various intron-generating TE families we describe here.

Introners Are Disproportionately Common in Aquatic Lineages.
Introners are disproportionately common in aquatic lineages, suggesting an important relationship between external environment and rates of Introner evolution. To our surprise, 7/8 Intronercontaining taxonomic groups (all except pezizomycotina fungi) inhabit aquatic environments. To further explore aquatic diversity, we analyzed 520 partial genomes from aquatic organisms, representing 71 distinct genera (19) (SI Appendix, Table 3). This revealed 25 additional Introner-containing taxonomic groups, yielding a total of 32 separate taxonomic groups. Each presumably represents independent acquisition/evolution of Introners ( Fig. 1), suggesting a highly punctuated pattern of Introner presence across the eukaryotic tree with little phylogenetic signal (P < 0.001, abouheif's C mean permutation test). Within this combined dataset, among 1,597 species for which aquatic/non-aquatic status was assignable, 17.0% of aquatic species (39/230) but only 2.6% of non-aquatic species (35/1367) exhibited at least one Introner family, confirming that aquatic habitat is significantly correlated with Introner presence (P < 10 −5 , two-sided Fisher's exact test). Our results imply that aquatic environments are correlated with the evolution of Introners and that aquatic environments may be an important driver of intron gain.
A test of environment association that accounts for phylogenetic relationship strengthens the conclusion that the genomes of organisms in aquatic environments are disproportionately likely to harbor Introners. Because some Introner-containing lineages are closely related, the apparent marginal correlation between aquatic environments and Introner presence might be an idiosyncratic result of shared ancestry rather than an independently associated factor. We therefore retrieved a phylogeny of all eukaryotic species in our study, and we found that a model where the rate at which a lineage evolves Introners depends on the environment is a significantly better fit than a model where the rate of Introner gain is independent of the environment (P < 4.1 × 10 −4 Pagel's test, SI Appendix, Table 4 and SI Appendix, Figs. S1 and S2, See Methods). This analysis therefore indicates that the strong correlation between environment and rates of Introner gain is not strictly a product of shared ancestry, but rather may reflect an important determinant of rates of intron gain.

Frequent Convergent Evolution of Introners from DNA
Transposons. Introner abundance varies substantially across Introner-containing lineages and even between extremely closely related organisms. We identified 27,563 Introner-derived introns within 175 Introner-containing genomes, representing 548 putatively separate Introner families defined based on sequence similarity (SI Appendix, Table 6). Introner families exhibited substantial diversity in copy number per genome (5 to more than 2,000), length (median length 30-654 nucleotides), GC content (20.9 to 83.3%), and percent of rare GC and GA 5′ splice sites (0 to 50.0% and 0 to 40.0%, respectively) (SI Appendix, Table 6). Introners themselves generally exhibit lower GC content than exons in host genomes. This may be consistent with the observation that DNA transposons are GC poor relative to host eukaryotic protein-coding regions (20,21). Genomes differed in the number of predicted Introner families, from one to 43 families ( Fig. 1 and SI Appendix, Table 5). We also found large differences between closely related organisms. For example, among three Micromonas species, an unknown species (isolate TARA_MED_95_MAG_00390) had no detectable Introners, M. commoda had two families (44 total Introners), and M. pusilla had seven families (3,566 total Introners) with no sequence similarity to the M. commoda Introners. In Florencialla, some Introner families were found in all five isolates and others in only a subset. We found similar patterns in the highest quality genomes, suggesting that data quality issues do not drive these results. Our findings suggest that Introner presence and content are extremely evolutionarily labile, consistent with rapid changes in intron abundance observed across eukaryotic lineages.
Diverse molecular functions for intron gain suggest that many nonautonomous transposons have convergently evolved into Introners. Some previously characterized Introner families exhibit signatures of DNA cut-and-paste TEs, such as 3-9 bp repeats flanking their insertion site (target site duplications or TSDs) and/ or inverted repeats at their 5′ and 3′ ends (terminal inverted repeats or TIRs), while others lack such signatures (12)(13)(14)22). Among Introner families for which TSD and/or TIR presence/ absence could be ascertained (both characteristics of DNA TEs), 49/130 show evidence for TSDs and 72/177 show evidence for TIRs. Conversely, other families lack these features, including cases where Introners have 100% end-to-end sequence identity. For example, in M. pusilla both TSD/TIR families and non-TSD/ non-TIR families are present (SI Appendix, Table 6). Introners also show remarkable diversity in mechanisms of splice-site recruitment. Among families with TSDs, we found a variety of orientations of splicing boundaries with respect to the TSDs. These included cases where the Introner carries the 3′ splice site and the 5′ splice site is recruited from the TSD (Fig. 2B), cases in which the reverse is true (Fig. 2C), or more complex cases in which either both splice sites are entirely or partially recruited from the TSD or from neighboring exonic sequence (Fig. 2E). While most families followed previous reports in which insertion and splicing do not lead to a change in mRNA sequence length (12,13), we also found cases where Introner insertion is associated with insertion or deletion of one or more neighboring codons (Fig. 2F). This exceptional range of molecular mechanisms suggests that Introners have independently evolved from diverse autonomous transposable elements and may explain recurrent bursts of intron gain across diverse eukaryotic lineages (7, 23).

The Majority of Introners May Propagate via DNA-Based
Mechanisms. Previous studies proposed different mechanisms for Introner mobilization. Some algal Introners appear to be miniature inverted-repeat transposable elements based on observation of TSDs, TIRs, and biased insertion into nucleosome linker regions (12). In contrast, fungal Introners lack TSDs and TIRs and are highly biased toward gene regions (insertion into nucleosome linkers was not studied) and have been interpreted as novel RNAbased elements that propagate through reverse-splicing of RNA copies of spliced Introners (14,16,17).
Although we observe exceptional molecular diversity (above), most Introner families exhibit at least one signature consistent with DNA transposition and ascomycetes are an outlier. Among 130 families for which both TSD and TIR presence/absence could confidently be determined, 59.2% have either TSDs, TIRs, or both (78.5% when we exclude ascomycetes fungi, see below). Presence of separate DNA-and RNA-based families predicts that putative DNA-based signatures are positively associated with each other across families and are negatively associated with the putative RNA-based signature of bias toward genes. However, TSDs are present in equal fractions of TIR-containing and non-TIR-containing non-ascomycetes families (51.7% (30/58) and 48.7% (19/40), respectively; P = 0.99 two-sided Fisher's exact test), and we did not find an association between TSD or TIR presence and nucleosome linker bias (SI Appendix, Table 7). Furthermore, there is little association between TSD or TIR presence and bias toward insertions in genes (SI Appendix, Table 7, see Methods). We also found no difference when comparing families that differed in TSD/TIR presence when accounting for species, suggesting that correlations were not obscured by unaccounted for interspecific differences (SI Appendix, Table 7). Ascomycete Introner families are an outlier, with no TIRs or TSDs. Together, our results suggest that most Introners propagate via DNA transposition.

Introners Show Insertional Preferences at Various Genomic
Scales. We next investigated the signatures of Introner insertion by studying Introner insertion positions at the level of genome region, nucleotide content, and chromatin structure. Introners are enriched in genes in 161/175 Introner-containing genomes ( Fig. 1 and SI Appendix, Table 5). This pattern echoes some DNA elements, for example piggyBac elements in Drosophila, which also preferentially insert into coding regions (24). Overrepresentation of Introners within genes could also reflect an insertional bias toward GC-rich regions, since genes are typically more GC-rich than intergenic regions. GC-bias has also been previously reported for DNA TEs (25,26). We find that Introner families enriched in genes also tend to show biased insertion into GC-rich motifs (P < 0.0001 binomial test, SI Appendix, Figs. S3-S5, SI Appendix, Table 5). One possibility is that TEs that exhibited a preexisting bias for insertion in GC-rich genic regions experience greater selection for efficient splicing to reduce gene disruption thereby creating new Introners. Finally, some putatively DNA-based Introners in M. pusilla and Aureococcus anophagefferens preferentially insert in nucleosomes linker regions (i.e., between nucleosomes ref. 12), similarly to many DNA transposons (27). Computationally predicted nucleosome occupancy profiles showed a bias toward linker region insertion for 78.8% (104/132) of Introner-containing genomes for which prediction was possible ( Fig. 1 and SI Appendix, Figs. S6 and S7, SI Appendix, Table 5 and 6).

Negative Selection Shapes the Distribution of Introner
Insertions. Introners exhibit a range of molecular phenotypes indicative of negative selection on the majority of new insertions. To evaluate splicing efficiency, we estimated the percent-splicedin (PSI) of Introners and other introns by comparing the relative read depths within adjacent exons and across each intron. Here, if an intron is very efficiently spliced, we should find few or no reads mapping across the intron. We find that observed Introners are generally more efficiently spliced than are other introns (P = 5.9 × 10 −3 , binomial test, Fig. 3 A and B and SI Appendix,  Fig. S8). Because observed Introners may not reflect splicing of all new insertions, it is likely that more deleterious insertions could be removed by selection and thus be absent from sequenced genomes. This pattern suggests that high-frequency Introners may have limited mis-splicing-related fitness costs due to negative selection purging Introners that are frequently mis-spliced. Similarly, we find that Introner insertions are biased toward lowly expressed genes relative to other introns (P = 2.4 × 10 −2 binomial test, Fig.  3C and SI Appendix, Fig. S9), as could be expected if some Introner insertions impose transcription-or splicing-associated costs. This bias is unlikely to result from an insertion site preference given preferential Introner insertion into GC-rich genes, which are typically more highly expressed (28). These data therefore suggest that negative selection impacts the range of Introner insertions we observe.

A New Model for the Evolutionary Forces Governing Intron Gain.
Our results suggest that a range of previously proposed models for the major forces governing intron gain requires reconsideration. We find no clear support for previous influential proposals that organismal complexity or small population size promotes intron gain (29). Most strikingly, among animals and land plants, two organismally complex groups with typically small effective population sizes, we find Introners in only two taxonomic groups (both animals), despite accounting for one-quarter (835/3,325) of studied genomes (P < 0.00001, two-sided Fisher's exact test). This dearth of Introners is all the more striking given that these lineages are intron-rich, are generally more slowly evolving at the sequence level which facilitates Introner discovery, and widely use introns in gene regulation (1). The large number of observed introns in these lineages is more likely the result of low rates of ancestral Intron loss rather than more recent intron gains (30). Furthermore, we find no evidence for a pattern of Introner gain in species whose biology predicts small population size, such as parasites or vertebrates.
We propose that the distribution of Introners reflects the propensity of lineages to acquire new genetic elements via horizontal gene transfer (HGT) (SI Appendix, Fig. S10). All Intronercontaining lineages except Ascomycetes and one species of blastocyst are aquatic organisms with aquatic organisms mostly represented by a remarkably diverse array of unicellular organisms. We propose that this pattern reflects greater rates of HGT for these species. Indeed, aquatic environments generally favor HGT (31)(32)(33)(34), and aquatic unicellular in particular have been shown to exhibit high rates of HGT possibly because they often live in dense microbial communities and require interactions with other species for their ecology (35). A variety of studies have detailed large-scale lateral gene transfer in aquatic protists, including several that have Introners (35)(36)(37)(38). The only two non-aquatic Intronercontaining lineages, ascomycetes and blastocysts, have also been reported to have large amounts of HGT (39,40). We do not find evidence of HGT of Introners identified in this study. However, DNA transposons more generally are well adapted for HGT and have frequently made jumps between highly diverse lineages (41). The introduction of a new DNA transposon unfamiliar to a host might also intensify selection for the independent evolution of Introners. Newly acquired TEs often evade host-mediated TE silencing mechanisms and thereby have the freedom to mobilize at high frequencies, representing a major cost to host fitness (41). The ability of a TE to be spliced could alleviate some of these costs. Thus, HGT could not only explain the highly punctated pattern of Introner presence across distantly related taxa, but also favors the independent evolution of Introners in new lineages.

Conclusion
The proximate and ultimate origins of introns remain a fundamental question in biology. Here, we demonstrate that Introners generate new introns on genomic scales in a remarkable diversity of eukaryotic lineages. Despite many similarities, the extensive molecular diversity that underlies Introner transposition reveals that a vast range of transposon families has independently evolved into Introners. In light of these findings, frequent horizontal transfer of TEs and the extreme aquatic biased distribution of species harboring Introners, we propose that a crucial factor governing lineages' tendency to gain introns over time is exposure to transfer of TEs from diverse unrelated eukaryotic organisms.

Methods
Accessing Genomic Data. We performed our systematic search for Introners on all annotated genomes in Genbank (last accessed 9:24 AM April 10, 2020). We used FTP links available through a CSV file downloadable from NCBI (SI Appendix, Table 1) to systematically access and download genomic data for our analyses.
We filtered out genomes that lacked annotation files or for which annotations were dubious based on low gene number (SI Appendix, Table 2). Fasta files and genome annotations were downloaded for the Tara Oceans Eukaryotic Genomes project, from https://www.genoscope.cns.fr/tara/. Genomes without GFF annotation files were filtered out, yielding a total of 520 genome assemblies.
Identifying Highly Similar Introns. To find candidate Introners, for each genome, we first extracted all annotated intron-exon structures, that is, genomic sequences corresponding to annotated protein-coding sequences spanning from translation start to stop codons, and with exon and intron sequences indicated by upper and lower case. We then extracted each intron along with up to 20 nucleotides of flanking sequence (requiring a minimum of ten flanking exonic nucleotides). We then searched for introns with sequence similarity to each other. Candidate similar pairs were identified by an all-against-all blast (42) of introns-plus-flanking sequences within each species, with a minimum e-value of 10 −5 . Previous results and our preliminary findings reveal that there exist several reasons that two introns can have extensive sequence similarity other than creation by the same Introner family; therefore, we performed several filtering steps to eliminate false positives. We filtered for two alternative reasons for intron-intron sequence similarity. First, many introns with extensive sequence similarity to each other owe this similarity to secondary insertion of transposable elements or to microsatellites within the intron interior. Conversely, introns from paralogous genes or gene regions can be similar due to duplication of a longer region. These two cases share a frequent signature, namely that the region of intron-intron sequence does not correspond to (roughly) the whole of both introns, but instead is either a subportion of the intron (in the first case) or extends beyond the intron (in the second case). Therefore, we required that the region sequence similarity between the two introns extends to near the exon-intron boundary. After iterative manual scrutiny, we chose to require that sequence similarity begins within a 15 base region spanning five exonic bases or ten intronic bases, and that this be the case for both introns for both 5′ and 3′ intronic boundaries.
After initially requiring end-to-end blast hits, we learned that secondary indels including secondary TE insertion led to many false negatives. Consequently, we applied a different strategy where we judged similarity between each pair of introns based on pairwise similarity of the two ends, either similarity between corresponding ends (5′ with 5′, 3′ with 3′, as expected from same-orientation insertion) or opposing ends (5′ with 3′, 3′ with 5′, as in opposite-orientation insertion). Up to 100 intronic nucleotides were assessed for each end. We required that the similarities were in the expected orientation (i.e., the interiors of the two introns lining up together).
Recent paralogous gene duplications can result in sequence similarity between intron sequences. While the requirement that nucleotide-level sequence similarity begins an end near the intron-exon boundary removes most such cases, we also observed cases where rapid exonic evolution (or simple chance substitution) led to paralogous sequences being retained (the clearest case involved the introns of the huge gene families fast-evolving var genes of Plasmodium species). To filter remaining false positives by introns in paralogous genes, we first translated all Introner-containing genes from DNA to protein sequence. Then we used diamond (version 0.9.24) (43), with default options except minimum e-value of 10 −20 , to identify and remove from the list cases of sequence similarity between encoded proteins for intron pairs with similar sequences.
Within each genome with remaining similar intron pairs, we then used pairwise similarities and a greedy algorithm to group introns with at least one remaining pairwise similarity into Introner families. Families with at least four introns were retained. We acknowledge that since we used a homology-based approach to identify Introners, we are more likely to identify Introners in lineages with slower evolutionary rates. Nonetheless, this possible source of bias apparently had very little effect on our results since we did not find any evidence for Introners in lineages which are generally slowly evolving in sequence such as land plants and mammals and instead primarily found Introners in lineages which evolve relatively rapidly such as green algae.
Filtering Assembly Errors. We also filtered out putative Introner families identified as a result of genome assembly problems. For example, sequencing adapters included in genome assemblies might sometimes be annotated as introns, in which case genome assemblies including many sequencing adaptors could have multiple similar annotated intronic sequences. We used blast searches to examine putative Introner for the presence of Illumina adaptors and removed families which contained them. We also performed an Internet search on each putative Introner family consensus sequence and subsections of each consensus sequence to ensure that Introner families did not embody or contain any known sequences associated with genome assembly or sequencing methods.
Filtering Introners with Low Complexity. We filtered putative Introners families with low sequence complexity since sequence similarity between these potential Introners could have resulted from alternative mechanisms than transposition, e.g., microsatellite expansion. To do this, we manually examined putative Introner family consensus sequences and looked for an abundance of short repetitive sequence motifs.
Finalized Introner Sequences. After filtering, we possessed a set of finalized Introner sequences sorted in fasta files by species and family within species. Fasta files for each Introner family in each species can be found at https://github.com/ lgozasht/Introner-elements.

A Representative Set of Genomes for Downstream Analyses Based On
Introner Content. We next scrutinized patterns of Introner family presence/ absence across all Introner-containing genomes. In particular, the presence of clusters of closely related organisms represented within the TARA Oceans genomes led to cases of multiple genomes with very similar Introner complements-both at the level of families and of specific insertions. At the same time, as mentioned in the main text, closely related genomes sometimes exhibit overlapping but non-identical sets of Introner families. Genomes containing identical sets of Introner families were grouped, leading to 16 groups, mostly including two genomes (13 groups), but ranging up to nine genomes.
Proportion of Introners Inside of Genes. Since Introners in intergenic regions cannot be annotated as introns, we developed a systematic method to re-cover them conditional on a known Introner family detected as described above. We employed multiple alignment using fast Fourier transform (MAFFT) (44) to conduct multiple sequence alignments for each Introner family in each species. Next, we generated a consensus sequence for each Introner family using a positional nucleotide frequency matrix. We required that greater than 50% of Introners possess the same nucleotide at a particular position for that base to be included in our consensus sequence. We BLASTed each consensus to its corresponding reference genome and filtered duplicate and self-hits.
We used a permutation test to interrogate possible enrichment for Introners in genes. If a genome is more gene dense, a transposon is more likely to land in a gene. To correct for this, we generated 1,000 permutations for the probability that a particular Introner will insert into a gene by chance by randomizing the Introner positions across the genome such that n = the number of total insertions and p = gene density. We compared these with our actual values to test for insertional enrichment in genic regions.

GC Content Analysis.
Genes are generally GC-rich relative to intergenic regions (45), and transposable elements have been shown to display preference for GC-rich regions (46,47). To test whether Introners that are enriched in genes also demonstrate a bias for GC-rich regions, we employed a permutation test. We calculated the GC content for the concatenated ten base pairs (bp) upstream and downstream of each insertion (20 bp total). We then generated 10,000 permutations for the GC content of randomly resampled 20 bp regions from the same gene in which each respective Introner was found. By using a relatively small window size of 20 bp, we hoped to more accurately capture specific insertion site biases by limiting the noise introduced by surrounding sequences, although we also used a window size of 100 bp and obtained similar results. We compared our observed GC proportions to the randomly sampled distribution and found that Introners in many species are also enriched within GC-rich regions. Across all species, we found a significant correlation between insertional enrichment in genes and insertional enrichment in GC-rich regions, suggesting that Introners may favor GC-rich regions rather than simply an insertion preference for genic regions per se (P < 0.0001; binomial test). Here, we constructed this comparison as a one-sided test where we asked if the proportion of species whose insertion preferences exceeded background genic GC content differed from random expectations. Note also that conditioning on the specific genes into which Introners insert is very conservative because any strong GC content skew within a gene would be reflected in the null distribution.

Nucleosome Occupancy Prediction and Analysis.
Since the vast majority of genomes in our survey lack available epigenetic data and most cannot be reliably cultured, we applied an in silico predictive approach to interrogate whether Introners insert into nucleosome linker DNA in other lineages. We used the program, NuPoP (48), to predict the nucleosome occupancy for the 10 kb spanning and surrounding the insertion sites of all Introners in each species for which we possessed at least four Introners with contiguous sequence assembled 5 kb upstream and 5 kb downstream of insertion sites. We required at least 5 kb around each Introner to ensure that we maximize the accuracy of predictions. The reason is that this approach is based on a hidden Markov model and therefore requires moderate sequence lengths to produce reliable results. We ran NuPoP with flags species=0 and model=4 first with Introners and then again with Introners computationally removed from the gene sequence as a control (SI Appendix, Fig. S6). We observe a pattern across most genomes of decreased nucleosome occupancy for the 100 bp surrounding the 5′ splice sites of Introners relative to background regions (P = 8.26e−07; binomial test) and are able to reproduce the pattern previously reported for Introners in Micromonas and Aerococcus using nucleosome profile data (SI Appendix, Figs. S6 and S7). We observe a decreased nucleosome occupancy within Introner sequences relative to background regions even more often (P = 4.00e−10; binomial test), suggesting that Introners insert into and inhabit nucleosome linker regions. When we perform the same comparison with Introners removed to replicate surrounding regions, we observe the opposite pattern, in which the nucleosome occupancy is higher (P = 9.13e−12; binomial test), suggesting that nucleosomes often flank Introner sequences (SI Appendix, Fig. S7). We used a Gaussian generalized linear model (GLM) through R to look for an association between the number of Introners and predicted nucleosome occupancy within Introners relative to background nucleosome occupancy of each Introner family in each species with formula delta_nuc_occup ~ number_ of_Introners. We did the same association for Introner length and species with formulae: delta_nuc_occup ~ Introner_length and delta_nuc_occup ~ species. We find that the number of Introners in each family and Introner length are both good predictors of nucleosome occupancy within Introners relative to background (P < 0.0001 and P = 0.04), with smaller families with shorter sequences having lower delta_nuc_occup. We postulate that this association may be due to reduced accuracy of predictions on small sample sizes. This may explain why in smaller, shorter Introner families we sometimes do not observe as clear of a pattern of low nucleosome occupancy in Introners. We note, however, that species is an even better predictor than the aforementioned variables (P < 0.0001) and that our ability to predict nucleosome profiles accurately with a sequence motif-based hidden Markov model (HMM) could also be limited in highly divergent species relative to those used for training the HMM initially or in low-quality genomes. Indeed, when we filter for species in which we observed delta_nuc_occup < 0 and use a GLM with the formula: delta_nuc_occup ~ GC_content, we find that GC content explains better than species (P = 0.0015).
Identifying TSDs and TIRs. We searched for evidence of TSDs in and around each Introner family in each species for which there were at least 15 genic Introners in the family. Most clearly, positions that are part of a TSD manifest as similarity between 5′ and 3′ ends within individual Introner from a single insertion site, but not globally between introns (i.e., corresponding nucleotides 5′ and 3′ ends match for intron A, but do not necessarily match between introns A and B). However, TSDs often include core GY/AG splice-site nucleotides, in which case nucleotides will match across introns as well. Thus to assess TSD presence/absence, we calculated both absolute match between corresponding 5′ and 3′ nucleotides (i.e., nucleotides the same distance from the splice site, for instance the first nucleotide of the intron and the first nucleotide of the downstream exon), as well as relative match, calculated as the fraction of matches within individual introns divided by the expected value calculated from 1,000 random 5′/3′ pairs. For each Introner family within each species, these values were assessed to look for TSD patterns including at least one nucleotide position with locus-specific match (e.g., NAGgy…nag, where N/n show significant match within but not between insertion sites). For patterns that included only across-intron matches (e.g., families where all or nearly all introns are preceded by an AG (AGgy…xyag), it is not possible to distinguish between the AG representing a TSD (Introner sequence = gy..xyag, TSD = AG), or the AG representing an insertion site without a TSD (Introner sequence = gy.. xyag or AGgy…xy, no TSD). In some instances, one of the two duplicate motifs was part of an extended TIR (see below). Otherwise, TSD presence/absence was called as ambiguous.
TIRs were searched for manually by searching the consensus sequence within a region extending from −20 to 20 nucleotides of the intron. This generally yielded either a clear extended TIR (≥6 nucleotides) or no evidence of a TIR. The few cases with partial or short TIRs were called as ambiguous. These calls are available at https://github.com/lgozasht/Introner-elements Examples used in Fig. 2 A-F Comparing Orthologs Between Aureococcus Isolates. We used BLAST to identify homology between Introner-containing genes in Aureococcus sp. (isolate TARA_AOS_82_MAG_00129) and Aureococcus anophagefferens (Genbank Acc. GCA_000186865.1). We used MAFFT (44) to perform a multiple sequence alignment (MSA) between each Introner-containing gene in Aureococcus sp. and its match in Aureococcus anophagefferens with the lowest e-value given the e-value < 0.01. We also performed an msa between translated proteins corresponding to these genes. The example shown in Fig. 2 stems from an alignment between an Introner-containing gene in Aureococcus sp. and Aureococcus anophagefferens Phosphoenolpyruvate carboxylase kinase 1 (NCBI Acc. XM_009038642.1) at both the nucleotide and protein level. In this example, Aureococcus sp. exhibits an Introner insertion at position 545 in this gene, which resulted in the addition of four amino acids relative to Aureococcus anophagefferens.
Identifying Potential Mobilizing Elements. We searched for transposase-encoding autonomous elements that may mobilize Introners with similar terminal sequences. For each Introner family in each species, we constructed position weight matrices of length 22 bp at the 5′ and 3′ ends of the elements. We then searched each respective species' genome for matches to the 5′ probability weight matrix (PWM) using PoSSuMsearch (49) and searched the downstream 10,000 bp of each match using the 3′ PWM. We also searched for matches among predicted repetitive elements found using RepeatModeler (50). Open reading frames were found between the 5′ and 3′ pairs of PWM matches, and their translated amino acid sequences were used to search a database of transposases (a subset of UniProt) using BLASTP (SI Appendix, Table 8).
Assessing Homology Between Introners in Different Species. We used BLAST to search for homology between consensus sequences of Introners in different species. We performed an all vs. all BLAST of Introner consensi and found no evidence of homology except between relatively recently diverged species (within the same genus). However, we did observe cases for TARA metagenomes (for which only the genus is reported) in which isolates within the same genus possess different Introner families. We treated those isolates as separate species throughout our study.

Checking for Associations Between TSD and TIR Presence and Introner
Architecture and Distribution. We tested for associations between TSD and TIR presence and other Introner statistics both on the subset of genomes that possessed Introner families with and without TSDs/TIRs and across all species. To do this, we used generalized linear models through R (SI Appendix, Table 7). To test for an association between TSD and TIR presence and insertional preference in genes, we fit a GLM with the following format: (Introners_in_genes, Introners_outside_genes) ~ TSDs and cbind(Introners_in_genes, Introners_outside_genes) ~ TIRs using a binomial family link function. As a control, performed the same analysis with species instead of TSD or TIR presence using a GLM of the format: (Introners_in_genes, Introners_outside_genes) ~ species.
We found that the term species better explains our data than TSD or TIR presence using Akaike Information Criterion (AIC).
To test for an association between TSDs and TIR and canonical splice-site usage, we fit a GLM of the form: canonical_splice_site_usage ~ TSDs and canonical_splice_ site_usage ~ TIRs under the Gaussian family link function. Again we performed the same association for species and found that species better explains our data than TSD or TIR presence. To test for an association between TSD and TIR presence and number of Introners, we used a GLM of the form:

number_of_Introners ~ TSDs number_of_Introners ~ TIRs
under a Gaussian link function. We find that TSDs and TIRs explain our data poorly in this case. To test for an association between TSD and TIR presence and delta nucleosome occupancy, we a GLM of the format: background_nuc_occup-nuc_occup_of_Introner ~ TSDs background_nuc_occup-nuc_occup_of_Introner ~ TIRs again using the Gaussian family link function. We performed the same association with species and again found that species better explained our data.
RNA-seq Analysis. For each species with identified Introners (genus level TARA metagenomes excluded), we searched the Sequence Read Archive (SRA) database for RNA-seq data, prioritizing sequencing runs conducted on the same individual from which the reference genome was assembled (SI Appendix, Table 9). We aligned the RNA reads to the reference genome using STAR (51), calculated the depth at each site using samtools (52), and identified splice junctions using leafcutter (53). We then used custom Python scripts to identify, for each intron, the number of splicing events that used the annotated splice junctions, as well as the number of splicing events that used non-canonical junctions within 50 nucleotides on either side of the annotated junction. We then used the R package lme4 (54) to construct generalized linear model of the form: (proper_splices,missplices) ~ Introner + depth + length (proper_splices,missplices) ~ depth + length to correct for the depth and length of each intron. To ensure that the i.e., variable (whether or not the intron is an Introner) was significantly correlated with splicing behavior, we calculated the likelihood ratio of the two models using the AIC (55). If the model containing the i.e., variable was a better fit, and if coefficient for the i.e., variable was positive, Introners in this species exhibit more canonical splicing than non-Introner intron.
We used a similar approach for percent spliced in (PSI; Fig. 3), using GLMs of the form: (spliced_in,proper_splices + missplices) ~ Introner + depth + length (spliced_in,proper_splices+missplices) ~ depth + length, ensuring that whether an intron is an Introner is significantly correlated with PSI using AIC likelihood ratios, and considering cases with negative coefficients for the Introner variable as cases where Introners are more likely to be spliced in than non-Introner introns (Fig. 3).
To identify whether Introner-containing genes were more lowly expressed than other genes in each species, we used two methods. First, we used a permutation test. We calculated the average read count for Introner-containing genes and then selected 10,000 sets of randomly sampled genes, where the probability that a given gene was sampled was proportional to its length. The number of random samples in which the average read count of the random sample was greater than the average read count of the Introner-containing genes acted as our P-value (SI Appendix, Table 9). Additionally, we also performed independent Mann-Whitney U tests comparing reads per kilobase of exon per million distributions for Introner-containing genes and other intron-containing genes. For this analysis, we also removed transcripts with less than 10 mapping reads.
Correlating Aquatic Lifestyle and Introner Presence. For phylogenetic tests, we downloaded the global tree from open tree of life (56) and pruned the tree to retain only species that we considered in this analysis. In the case of TARA metagenomes, for which we only possessed the genera, we randomly selected one species within each genus to represent all isolates. Our tree can be found at https://github.com/lgozasht/Introner-elements/blob/main/ pruned_tree2.nwk.gz. The open tree of life maintains arbitrary branch lengths (branch lengths of 1). We recognize that arbitrary branch lengths can imply that more total evolution has taken place between the root and the tips of the tree for those species with more ancestors (57). Thus, to test for whether or not phylogenetic signal could explain the distribution of Introners across species, we used abouheif's C mean test through the R package adephylo (58,59). This method only considers topological relationships among species and is thus robust to branch length ambiguities (60). We obtained a P value < 0.001 across 1,000 permutations, suggesting that phylogenetic signal in itself poorly explains the observed distribution of Introners across the eukaryotic tree. To evaluate a correlation between aquatic lifestyle and the presence of Introners, we used Pagel's test through the R package phytools (57, 61) using the aforementioned global phylogeny as an input (SI Appendix, Fig. S2). Pagel's test estimates rates of evolution for two given traits, constructs models in which the two traits evolve independently, co-dependently, or with one trait dependent upon the other, and model fits are compared using AIC (SI Appendix, Fig. S1). To prepare the input data for these tests, we manually annotated each species considered as aquatic or not using the following criterion: To be considered an aquatic species, a species must spend most of its life surrounded by water (SI Appendix, Table 4). A species also cannot be an obligate parasite of a terrestrial organism even though in such a case the species may live primarily within an aqueous solution.
Data, Materials, and Software Availability. Previously published data were used for this work (NCBI).