Integrating Phylogenetics With Intron Positions Illuminates the Origin of the Complex Spliceosome

Abstract Eukaryotic genes are characterized by the presence of introns that are removed from pre-mRNA by a spliceosome. This ribonucleoprotein complex is comprised of multiple RNA molecules and over a hundred proteins, which makes it one of the most complex molecular machines that originated during the prokaryote-to-eukaryote transition. Previous works have established that these introns and the spliceosomal core originated from self-splicing introns in prokaryotes. Yet, how the spliceosomal core expanded by recruiting many additional proteins remains largely elusive. In this study, we use phylogenetic analyses to infer the evolutionary history of 145 proteins that we could trace back to the spliceosome in the last eukaryotic common ancestor. We found that an overabundance of proteins derived from ribosome-related processes was added to the prokaryote-derived core. Extensive duplications of these proteins substantially increased the complexity of the emerging spliceosome. By comparing the intron positions between spliceosomal paralogs, we infer that most spliceosomal complexity postdates the spread of introns through the proto-eukaryotic genome. The reconstruction of early spliceosomal evolution provides insight into the driving forces behind the emergence of complexes with many proteins during eukaryogenesis.


Introduction
The spliceosome is a dynamic ribonucleoprotein (RNP) complex that assembles on the pre-mRNA to remove introns, intervening sequences between the exons. The exons are spliced together to form a mature mRNA. Like the complex, the exon-intron structure of protein-coding genes is characteristic of eukaryotes. Transcription and splicing occur in the nucleus, which physically separates these processes from protein translation. Failure of correct splicing generally results in non-functional proteins.
The composition of the spliceosome changes during the splicing cycle (Wilkinson et al. 2020). It consists of five small nuclear RNAs (snRNAs) U1, U2, U4, U5, and U6, which are bound by multiple proteins to form small nuclear RNPs (snRNPs), and several additional subcomplexes and factors. In the splicing reaction, the 5′ splice site first reacts with the adenosine branch point, forming a lariat structure. Subsequently, the exons are ligated and the lariat intron is released. The components of the spliceosome orchestrate different activities in a precisely ordered manner: they recognize the splice sites and the branch point sequences, prevent a premature reaction, perform the splicing reaction, and assemble, remodel, or disassemble the complex. The spliceosome is one of the most complex molecular machines in eukaryotic cells and a complex spliceosome was present in the last eukaryotic common ancestor (LECA) (Collins and Penny 2005).
Eukaryotes have two types of introns that are recognized by different spliceosome complexes. A vast majority of introns are of the U2-type and are recognized by the major spliceosome; the U12-type introns comprise a small minority (Moyer et al. 2020). The minor spliceosome specifically recognizes the U12-type introns and most proteins of the major spliceosome as well as U5 snRNA are also part of the minor spliceosome (Turunen et al. 2013;Bai et al. 2021). The other snRNAs have a minor-spliceosome equivalent (U11, U12, U4atac, and U6atac) and a few minor-spliceosome-specific proteins have been identified, especially in the U11/U12 di-snRNP (Turunen et al. 2013). The minor spliceosome and U12-type introns were also present in LECA (Russell et al. 2006).
In sharp contrast to a probably intron-rich LECA (Csuros et al. 2011;Vosseberg et al. 2022) with a complex spliceosome, prokaryotic genes lack spliceosomal introns, which must have emerged at some time during eukaryogenesis. Spliceosomal introns and the key spliceosomal protein PRPF8 are thought to derive from self-splicing group II introns in prokaryotic genomes. This is based on similarities in the splicing reaction, function, and structure of the RNAs involved, as well as the homology inferred between the spliceosomal protein PRPF8 and the single protein encoded by group II introns, the intron-encoded protein (IEP) (Zimmerly and Semper 2015). Recent work has suggested that the emergence of intragenic introns might have been an early event during eukaryogenesis (Vosseberg et al. 2022). The evolutionary histories of a few gene families in the spliceosome have been described (Anantharaman et al. 2002;Veretnik et al. 2009;Califice et al. 2012) and they suggest that gene duplications played a pivotal role in the emergence of the complex spliceosome. Yet, a detailed picture of the origins of the full spliceosome, one of the most complex machines to emerge during eukaryogenesis, is lacking. This paper details in-depth phylogenetic analyses to reconstruct the spliceosome in LECA and the evolutionary histories of these LECA proteins in the prokaryote-toeukaryote transition. Subsequent integration of the phylogenetic trees with the positions of introns allows to investigate the relation between the origin of the spliceosome and the emergence of intragenic introns. Our findings underline the role of gene duplications in establishing the complex LECA spliceosome and we detected a strong evolutionary link with the ribosome. The intron analyses suggest that the emergence of a complex spliceosome occurred late, relative to the spread of introns.

Complex Composition of the LECA Spliceosome
To infer the evolutionary origin of the LECA spliceosome, it is first necessary to establish which proteins were likely present in the spliceosome in LECA. The most recent systematic inventory of the composition of LECA's spliceosome stems from 2005 (Collins and Penny 2005), and since then, multiple additional proteins, such as the minor-spliceosome-specific proteins, have been traced back to the eukaryotic ancestor. In conjunction with the enormous increase in genomic data, this provides ample reasons to update the reconstruction of the composition of the LECA spliceosome. We carried out this reconstruction by performing homology searches with spliceosomal proteins of humans (supplementary Table S1, Supplementary Material online) and baker's yeast (supplementary Table S2, Supplementary Material online), two species whose spliceosomes are well-studied. We used a strict definition of the spliceosome, which excludes proteins that function in related processes such as the coupling of splicing with transcription and the regulation of splicing. If we identified orthologs in multiple Opimoda and Diphoda species (Derelle et al. 2015) (see Materials and Methods), we inferred that a spliceosomal protein was ancestral. With these criteria, 145 spliceosomal orthogroups (OGs) could be traced to LECA (fig. 1, supplementary Table S3, Supplementary Material online). This number is nearly twice as large as the previously estimated 78 spliceosomal proteins in LECA (Collins and Penny 2005), a consequence of the expanded genomic sampling of eukaryotic biodiversity and increased knowledge on eukaryotic spliceosomes. The inferred number of spliceosomal LECA OGs is slightly lower than the number of spliceosomal proteins in humans (164, only one LECA OG missing) and substantially larger than the number of proteins in the yeast spliceosome (99,86 LECA OGs present). In addition to these proteins, five major spliceosomal snRNAs and four minor-spliceosome-specific snRNAs were also present in LECA.
Unresolved Origin of PRPF8 From Intron-encoded Protein and Additional Group II Introns in Asgard Archaea As described above, the U5-snRNP protein PRPF8 is a remnant of self-splicing group II introns. The prokaryotic origins of this system could, in principle, be inferred from the phylogenetic affinity of IEP and the spliceosomal PRPF8 protein, as the reverse transcriptase (RT)-like domain in PRPF8 is homologous to the RT domain in IEP (Dlakić and Mushegian 2011;Qu et al. 2016;Zhao and Pyle 2016). However, phylogenetic analysis of this domain is hindered by the high sequence divergence of PRPF8, and to a lesser extent its paralog telomerase, relative to prokaryotic RT domains. In our analyses, the nuclear homologs of IEP are not clearly associated with a particular IEP type and their exact phylogenetic position in the IEP tree is unresolved (supplementary fig. S1a, Supplementary Material online).
Group II introns occur predominantly in bacteria. A recent study showed that most complete archaeal genomes do not contain group II introns, with the exception of Methanomicrobia (Miura et al. 2022). We detected group II introns in several Asgard archaeal genomes, which are from multiple different IEP types (supplementary fig. S1b, Supplementary Material online). This finding expands the set of observed IEP types in archaea to also include ML, D, E, CL2A, and a separate CL type. The presence of these "bacterial" mobile elements in Asgard archaea is in good agreement with the diverse mobile elements that were recently found in circular Heimdallarchaeum genomes and the proposed continuous influx of bacterial genes in Asgard archaea (Wu et al. 2022). This so far unappreciated wide diversity of self-splicing group II introns in Asgard archaea might indicate the presence of such elements in the archaeal ancestor of eukaryotes.

Expansion of the Emerging Spliceosome Through Extensive Gene Duplication
All other 144 spliceosomal OGs do not have a homolog in group II introns. We performed phylogenetic analyses to infer their respective evolutionary origins (supplementary Table S4 MBE spliceosomal genes increased the number of spliceosomal proteins with a factor of 1.4. The ancestral spliceosomal units themselves also originated in most cases from duplication, but then, from a gene with another function in the proto-eukaryotic cell. For 33 ancestral units, we could not detect other homologs and these were, therefore, classified as proto-eukaryotic inventions. One single spliceosomal OG, AAR2, was surprisingly found to be one-on-one orthologous to a gene in a limited number of prokaryotes, including Loki-and Gerdarchaeota (Supplementary Information). Over a hundred proteins seemed to have been recruited to the emerging spliceosome at different points during eukaryogenesis. Subsequent duplications of these proteins resulted in an even more complex spliceosome in LECA.
Eukaryotic genomes are chimeric in nature, with genes originating from the Asgard archaea-related host, the alphaproteobacteria-related protomitochondrion or other prokaryotes by means of horizontal gene transfer. The eukaryotic spliceosome mirrors this general trend. It contains considerable numbers of genes from archaeal and bacterial origin, making it a chimeric complex in phylogenetic origin ( fig. 2b). The largest group, however, is comprised of genes for which we could not detect ancient homologs in prokaryotes and possibly originated de novo. This suggests that novel eukaryote-specific folds played a major role in shaping the emerging spliceosome. It is noteworthy that none of the acquisitions from bacteria could be traced back to alphaproteobacteria. This argues against a direct contribution of the mitochondrial endosymbiont to the spliceosome.  OGs that are only present in the minor spliceosome are displayed as minor-spliceosome-specific. The main differences between the major and minor spliceosomes are the presence of a U11/U12 di-snRNP instead of U1 and U2 snRNPs and the replacement of U4 and U6 snRNA with U4atac and U6atac snRNA. Two candidate minor-spliceosome-specific proteins that we identified in this study are shown in the dotted box. snRNP, small nuclear ribonucleoprotein; snRNA, small nuclear RNA; NTC, Prp19-associated complex; NTR, Prp19-related complex; RES, retention and splicing complex; EJC, exon-junction complex; RRM, RNA recognition motif; ZnF, zinc finger; PPIase, peptidylprolyl isomerase; HAT, half-a-tetratricopeptide repeat; Ub, ubiquitin; HEAT/ARM, HEAT or armadillo repeats.

Spliceosomal Proteins Originated Predominantly
Phylogenetics and Intron Positions Illuminate Spliceosomal Origin · https://doi.org/10.1093/molbev/msad011 MBE origin. The U5 snRNP protein EFTUD2 is a paralog of elongation factor 2 ( fig. 3a), which catalyzes ribosomal translocation during translation elongation. The archaeal ortholog performs the same translocation function, yet, also probably plays a role in ribosome biogenesis that is similar to the other proto-eukaryotic paralog EFL1 (Lo Gullo et al. 2021). The U4/U4atac-binding proteins SNU13 and PRPF31 (Nottrott et al. 2002) can be linked to the C/D-box snoRNP ( fig. 3b and c). SNU13 is also part of this snoRNP (Watkins et al. 2000) and PRPF31 originated from a C/D-box snoRNP protein. The archaeal orthologs NOP5 and RPL7Ae are part of the functionally equivalent C/D box sRNP ( fig. 3d), which is involved in ribosome biogenesis by modifying rRNA (Aittaleb et al. 2003;Breuer et al. 2021). The eukaryotic DDX helicases, of which six are part of LECA's spliceosome, evolved from prokaryotic DEAD and RHLE proteins, which also function in ribosome assembly (Charollais et al. 2004

FIG. 2.
Evolutionary history of spliceosomal proteins before LECA. (a) Annotations of the parent nodes of spliceosomal OGs. These parent nodes are shown in red in the example trees below. If the sister OG of a spliceosomal OG is also spliceosomal, the parent node was classified as a within-spliceosome duplication. If no other homologs outside the OG could be detected, the parent node was classified as invention. In case of a non-spliceosomal sister OG, to-and from-spliceosome duplications were distinguished based on the function of other homologs. If the sister group of the OG consisted of prokaryotic sequences, it was classified as an acquisition. The Lsm and Sm heptamer rings that bind U6 or U6atac snRNA and other snRNAs, respectively, are also of archaeal origin. The archaeal homologs, called Sm-like archaeal proteins (SmAPs), are poorly characterized RNA-binding proteins that might function in tRNA processing and RNA degradation (Lekontseva et al. 2021). The SmAP genes are located directly adjacent to ribosomal protein RPL37e (Mura et al. 2013), emphasizing the potential link with translation. The eukaryotic Lsm ring is involved in different forms of RNA processing besides splicing (Mura et al. 2013), including rRNA maturation (Kufel et al. 2003). During eukaryogenesis, the Lsm ring gained a U6(atac) binding function and was recruited into the spliceosome. Subsequent gene duplications resulted in two types of heteromeric rings of Lsm/Sm proteins in the spliceosome (supplementary fig. S3, Supplementary Material online, Supplementary Information). Phylogenetics and Intron Positions Illuminate Spliceosomal Origin · https://doi.org/10.1093/molbev/msad011 MBE A substantial fraction of the LECA spliceosome OGs contains an RNA recognition motif (RRM) ( fig. 1). The proteins in this family perform diverse functions, as this domain can not only bind RNA, but is also involved in protein-protein interactions (Maris et al. 2005). RRM proteins were likely acquired from a bacterium during eukaryogenesis, as proteins with this domain are present in some bacteria. Although the tree is largely unresolved due to the short length of the motif, multiple recruitments into the spliceosome can be observed, some followed by intraspliceosome duplications (supplementary fig. S4a, Supplementary Material online). Functions of other RRM proteins that are closely related to the spliceosome OGs include transcription, splice site selection, and mRNA degradation. Some OGs contain multiple RRMs, pointing at a rich history of domain and gene duplications before LECA in this family.
Many Minor-spliceosome-Specific Proteins are Closely Related to a Major Spliceosome Protein The major and minor spliceosomes share many subunits (Turunen et al. 2013;Bai et al. 2021) and this was very likely also the case in LECA. We inferred 13 minor-spliceosomespecific proteins in LECA ( fig. 1). Six of these are closely related to a major-spliceosome-specific protein. The RRM proteins SNRNP35 and ZRSR have a major spliceosome equivalent as their sister paralog (supplementary fig. S4b and c, Supplementary Material online). RNPC3 is closely related to SNF but probably not as its sister paralog (supplementary fig. S4d, Supplementary Material online). The sister paralog of RNPC3 is the poorly characterized RBM41. Its phylogenetic profile, however, corresponds with minor spliceosome OGs (de Wolf et al. 2021). If RBM41 is part of the minor spliceosome, the RNPC3-RBM41 duplication would represent the only identified duplication within the minor-spliceosome-specific OGs. The phylogenetic position of the other minor spliceosome OGs with an RRM is unresolved (supplementary fig. S4a, Supplementary Material online). ZMAT5 and SCNM1 are members of the U1-type zinc finger family. The equivalent of ZMAT5 in the major spliceosome is SNRPC ) and SCNM1 functions as a combination of SF3A2 and SF3A3 (Bai et al. 2021 This protein has not been characterized either, yet its phylogenetic profile strongly suggests a function in the minor spliceosome. A peculiar observation that we made for all major/minor pairs mentioned above is that the branch in the phylogenetic tree leading from duplication to the minor-spliceosome-specific OG is considerably shorter than the one leading to the major-spliceosome-specific OG (supplementary fig. S5d, Supplementary Material online). This means that these major-spliceosome-specific OGs have diverged more from the ancestral preduplication state and suggests that the function of the minor-spliceosome-specific SNRNP35, ZRSR, RNPC3, ZMAT5, SCNM1, TXNL4B, and possibly WDR25, better reflect the ancestral state.

Substantial Intron Spread Predating Spliceosomal Duplications
In a previous study, we investigated the spread of introns in proto-eukaryotic paralogs (Vosseberg et al. 2022). Intron positions that are shared between genes that duplicated during eukaryogenesis are likely shared because they were present in the gene before it duplicated. By analyzing intron positions in spliceosomal OGs, we can relate duplications in the primordial spliceosome to the spread of the elements that they function on, the introns. Therefore, we applied the same approach as in our previous study to the paralogs in the spliceosome. 45% of duplications that probably resulted in a novel spliceosomal gene had at least one intron traced back to the preduplication state (13 of the 29 to-spliceosome duplications). For 46% of the within-spliceosome duplications, we detected shared introns between paralogs in the spliceosome (18 out of 39).
The presence of introns in ancestral genes that themselves likely did not function in the spliceosome is strikingly illustrated by the DDX and DHX helicases, with three to seven introns traced back to before the first duplication after the acquisition from prokaryotes ( fig. 3). Introns shared between spliceosomal paralogs are also found in the LSM, PPIase, and WD40 families (supplementary fig.  S3a, Supplementary Material online, supplementary Table S5, Supplementary Material online). The U5 snRNP proteins SNRNP200 and EFTUD2, which interact with PRPF8, shared multiple introns with paralogs outside the spliceosome and likely contained introns before they became part of the spliceosome ( fig. 3, supplementary Table S5, Supplementary Material online). These numbers and cases suggest that introns were already present in a substantial number of ancestral genes before the corresponding proteins were recruited into the spliceosome and, subsequently, duplicated within the spliceosome.

Duplication and Subfunctionalization Completed Multiple Times After Eukaryogenesis: U1A/U2Bʺ
A notable difference between the LECA spliceosome and the human and yeast spliceosomes is the presence of two proteins in both human and yeast stemming from a single SNF protein in LECA. In early studies, the single SNF protein in Drosophila melanogaster was seen as the derived state and two separate proteins, U1A and U2Bʺ, were proposed to represent the ancestral state  Phylogenetics and Intron Positions Illuminate Spliceosomal Origin · https://doi.org/10.1093/molbev/msad011 MBE Drosophila SNF has a dual role in the spliceosome. It is part of the U1 snRNP, where it binds U1 snRNA, and part of the U2 snRNP, where it binds U2 snRNA and U2A' (Weber et al. 2018). In humans and yeast, U1A and U2Bʺ have subfunctionalized and perform the respective functions as indicated by the snRNP in their name. To assess whether a similar subfunctionalization has occurred in other lineages where SNF had duplicated, we looked for patterns of recurrent sequence evolution in the different paralogs with our previously published pipeline (von der Dunk and Snel 2020). Two fates could be distinguished, which we refer to as U1A and U2Bʺ based on the fates in model organisms. This distinction was based on a diffuse, mainly U1A-specific signal. Upon inspection of the two fate clusters and comparison with single SNF orthologs, the fate separation seemed to be predominantly based on recurrent substitutions in the first RRM of U1A and the recurrent loss of the second RRM in U2Bʺ ( fig. 4). We inferred 16 RRM loss events in U2Bʺ-fate proteins (supplementary fig. S6b, Supplementary Material online). These recurrent sequence changes allow us to predict which inparalog is likely to have a U1A function and which one has a U2B" function in organisms where detailed biochemical studies are lacking. Besides these remarkable findings on recurrent sequence evolution, the repeated post-LECA duplications suggest that the complexification of the spliceosome by duplication during eukaryogenesis could, in part, have been driven by the same process as happened multiple times after LECA.

A Chimeric Complex Spliceosome that Postdates the Initial Proliferation of Introns Through the Genome
The spliceosome is one of the most complex molecular machines in present-day eukaryotes. In this study, we reconstructed the composition of the spliceosome in LECA and traced the sometimes byzantine evolutionary histories of these 145 inferred spliceosomal proteins prior to LECA. Previous research has established that the core of the spliceosome-the U2, U5, and U6 snRNAs and PRPF8-as well as the spliceosomal introns themselves evolved from self-splicing group II introns (Zimmerly and Semper 2015). Proteins of archaeal and bacterial origin were added to this core, especially proteins that performed a function in ribosome biogenesis or translation. For many proteins, we could not detect other homologous proteins, suggesting that the primordial spliceosome expanded with spliceosome-specific folds. Subsequent expansions resulted from the numerous gene duplications that we observed. These duplications enabled us to assess the extent of intron positions that were shared between paralogs and likely predated the duplication event (Vosseberg et al. 2022). Our ancestral intron position reconstructions support the presence of introns in almost half of the proteins before their recruitment into the spliceosome. This suggests that introns were already widespread through the genome when most components of the complex spliceosome emerged. The increase in spliceosomal complexity did not coincide with the initial widespread increase in intron numbers, but followed it instead. Additional introns were probably inserted in spliceosomal genes after duplication. We propose a scenario in which intragenic introns emerged early in eukaryogenesis and the complex spliceosome emerged relatively late.
The group II introns that gave rise to the spliceosomal introns are commonly proposed to have come from the protomitochondrion (Cavalier-Smith 1991; Martin and Koonin 2006). Notwithstanding the extent of horizontal gene transfer of organellar group II introns among eukaryotes (Zimmerly et al. 2001), group II introns were probably present in the mitochondria in LECA (Kim et al. 2022). Our analysis did not yield sufficient phylogenetic signal to confidently position PRPF8 in the IEP tree. However, the identification of multiple intron types in Asgard archaea makes an alternative scenario plausible, in which group II introns were present in the archaeal genome before mitochondrial endosymbiosis (Vosseberg and Snel 2017;Vosseberg et al. 2022).
Proteins involved in the assembly and functioning of another large RNP in the cell, the ribosome, became part of the primordial spliceosome, supplemented with other RNA-binding proteins. The evolutionary link with the ribosome emphasizes the comparable composition as a RNP with catalyzing RNA molecules (ribozymes). In contrast with the other spliceosomal snRNAs, the U1/U11 and U4/U4atac snRNAs did probably not originate from the introns themselves. However, an evolutionary link with translation and rRNA processing is present for these snRNAs too. U1/U11 snRNA likely evolved from a tRNA (Hogeweg and Konings 1985). The evolutionary histories of SNU13 and PRPF31 and similarities between U4 and C/D-box RNAs suggest that the U4(atac) snRNP evolved from a C/D-box snoRNP (Watkins et al. 2000).
The contribution of gene duplications in shaping the LECA spliceosome is in line with the central role of duplications in establishing eukaryotic features during eukaryogenesis (Makarova et al. 2005;Vosseberg et al. 2021). Gene duplications were key for the emergence of spliceosome-specific proteins from proteins that were part of other complexes as well as for expanding proteins that were already part of the spliceosome. This pattern has also been observed for the kinetochore ). These kinetochore proteins, however, came from a wider variety of cellular processes compared with the spliceosome. The origin of another eukaryotespecific complex, the nuclear pore, compares well with the spliceosome regarding the chimeric prokaryotic ancestry of its components (Mans et al. 2004). This is unlike complexes and processes that predated eukaryogenesis, such as transcription and translation, which have a more consistent phylogenetic signal (

Origin of Two Types of Introns and Two Types of Spliceosomes
Two types of introns were present in the LECA genome, U2 and U12, which were removed from the primary transcripts by the LECA major and minor spliceosome, respectively (Russell et al. 2006). The far majority of introns were probably of the U2-type (Vosseberg et al. 2022). Different scenarios have been postulated for the emergence of two types of introns (Burge et al. 1998). In some scenarios, different intron types diverged from an ancestral set of introns, either in the same proto-eukaryotic lineage or two separate lineages that later fused. An alternative scenario proposes that the two types of introns originated from two separate introductions of group II introns in the genome. Previously, we called the separate introductions scenario unlikely based on the observed U12-type introns that are shared between proto-eukaryotic paralogs (Vosseberg et al. 2022). The enormous overlap in composition between the major and minor spliceosomes (Turunen et al. 2013;Bai et al. 2021) refutes separate origins of these complexes from different group II introns. Many minor-spliceosome-specific proteins have a close homolog in the major spliceosome and the minor-spliceosome-specific snRNAs have equivalents in the other spliceosome type. This suggests that the divergence between the major and minor spliceosomes occurred relatively late in pre-LECA spliceosome evolution, after the addition of U1 and U4 snRNA and U1 and U2 snRNP proteins. The minor-spliceosome-specific proteins were estimated to have accumulated fewer substitutions after the duplications that separated major-and minor-spliceosome-specific OGs. This suggests that the latter better reflect the ancestral situation. The U12-type introns and the minor spliceosome might, therefore, have originated earlier than the abundant U2-type introns and the major spliceosome.

Evolution of Spliceosomal Complexity
During eukaryogenesis, the recruitment of proteins and gene duplications resulted in an increase in spliceosomal complexity. Spliceosomal evolution after LECA is, in most eukaryotic lineages, dominated by simplification. A clear example is the minor spliceosome, which was lost recurrently at least 23 times (Supplementary Information). Certain lineages have experienced a substantial loss of spliceosomal genes that were part of the LECA spliceosome (supplementary fig. S7, Supplementary Material online). Only 59% of the LECA OGs are present in Saccharomyces cerevisiae, for example. Reduced spliceosomes have also been described in red algae and diplomonads (Hudson et al. 2019;Wong et al. 2022).
The most prominent example of a more complex spliceosome after LECA is the duplication of SNF in at least 22 lineages. To the best of our knowledge, this is the highest number of independent gene duplications in eukaryotes reported so far. It is slightly more than the 16 MadBub duplications (Tromer et al. 2016) and the 20 EF1β/δ duplications that were described before (von der Dunk and Snel 2020). We detected patterns of recurrent sequence evolution in the resulting paralogs, pointing at similar fates of these paralogs across eukaryotes. Given the described fates of the SNF paralogs in vertebrates, fungi, plants, and Caenorhabditis elegans, a similar subfunctionalization into dedicated U1 and U2 snRNP proteins in other lineages with duplications is likely.
The recurrent loss of the second RRM in proteins with a predicted U2Bʺ fate suggests that the function of this RRM is mainly restricted to the U1A role. Whereas the function of the first RRM has been described as binding to U1 and U2 snRNA, the function of the second RRM has remained elusive (Williams et al. 2013). The observation of recurrent loss of this RRM in specifically U2Bʺ proteins provides possible directions for further molecular research.
The dual-function SNF protein seems to be poised for duplication and subsequent subdivision of the roles in the U1 and U2 snRNP. It is tempting to speculate that the recurrent duplication of SNF indicates that this specific gene duplication and subsequent subfunctionalization could, in principle, have occurred during eukaryogenesis instead. Because it did not happen to be duplicated then, it could be seen as "unfinished business" during eukaryogenesis. The cases of independent gene duplications after LECA might be used as a model for proto-eukaryotic gene duplications. Because these duplications happened relatively recently, experiments based on ancestral protein reconstructions can be performed more reliably, as has been done for the SNF family in deuterostomes (Williams et al. 2013;Delaney et al. 2014). These experiments can provide insight into the role of adaptive or neutral evolution (Finnigan et al. 2012) in creating the complex spliceosome (Vosseberg and Snel 2017).

Investigating the Emergence of the Complex Eukaryotic Cell
Our study provides a comprehensive view on the origin of the numerous proteins in this complex molecular machine, also in relation to the spread of the introns it functions on. Further studies on the spliceosome composition in diverse eukaryotes have the potential to identify more spliceosomal proteins in LECA. New developments in detecting deep homologies (Jumper et al. 2021;Monzon et al. 2022) could reveal additional links for the spliceosomal proteins that we classified as inventions in this study. Phylogenetic analyses combined with intron analyses on the numerous other complexes that emerged during eukaryogenesis could further illuminate their origin and, thereby, the major transition from prokaryotes to eukaryotes.

Data
We used a diverse set of 209 eukaryotic and 3,466 prokaryotic (predicted) proteomes, as compiled for a previous study (Vosseberg et al. 2021) from different sources Phylogenetics and Intron Positions Illuminate Spliceosomal Origin · https://doi.org/10.1093/molbev/msad011 MBE (Huerta-Cepas et al. 2016;Zaremba-Niedzwiedzka et al. 2017;Deutekom et al. 2019). Proteins from 167 of the eukaryotic species had been grouped in OGs using different approaches (Deutekom et al. 2021). To illuminate the evolutionary history of some protein families (see below), we made use of the widely expanded set of Asgard archaeal genomes that has come available since. By including genomes from numerous studies (Liu et al. 2018(Liu et al. , 2021Tully et al. 2018;Huang et al. 2019;Seitz et al. 2019;Imachi et al. 2020;Farag et al. 2021;Sun et al. 2021;Zhao and Biddle 2021;Wu et al. 2022), the number of Asgard archaeal proteomes in our expanded set amounted to 133 in total. If no predicted proteome was available, the genomes were annotated with Prokka v1.13 (Seemann 2014) for the genomes from (Liu et al. 2018;Seitz et al. 2019) or v1.14.6 with the metagenome option for the genomes from (Farag et al. 2021;Liu et al. 2021).

Reconstructing LECA's Spliceosome
To infer the composition of the spliceosome in LECA, we searched for orthologs of proteins in the well-studied Homo sapiens and Saccharomyces cerevisiae spliceosome complexes in other eukaryotic proteomes. A list of human and budding yeast spliceosomal proteins was obtained from the UniProt database (The UniProt Consortium 2019) on February 26, 2020, only including manually reviewed proteins (supplementary Table S1, 2, Supplementary Material online). Proteins that are involved in other processes (such as transcription and polyadenylation) and splice site selection and splicing regulation were removed. The list was supplemented with human spliceosomal proteins from recent literature (Bai et al. 2021;de Wolf et al. 2021;Sales-Lee et al. 2021). Initial evolutionary scenarios of these proteins were inferred based on the approach of Van Hooff et al. (2019). In short, the human and yeast protein sequences were searched against our in-house eukaryotic proteome database (Deutekom et al. 2019) with blastp (Altschul et al. 1990 (Katoh and Standley 2013). These alignments were trimmed with trimAl v1.4.rev15 (gappyout option) (Capella-Gutiérrez et al. 2009) and a phylogenetic tree was inferred with IQ-TREE v1.6.4 (Nguyen et al. 2015) using the LG + G4 model to establish the initial scenario (van Hooff et al. 2019): (i) easy, in case of orthologs in a diverse set of eukaryotes including at least two Opimoda and two Diphoda species (Derelle et al. 2015); (ii) ancient (pre-LECA) duplication, when the set of homologs also includes clades of more distantly related homologs across eukaryotes; (iii) lineage-specific (post-LECA) duplication, when the spliceosomal function likely originated after LECA; (iv) taxonomically limited, with homologs in a limited set of eukaryotes. The latter cases were further studied by checking hits in the complete set of eukaryotes. For SNRNP27, CASC3, and WBP11, hits to the more sensitive Pfam models PF08648, PF09405, and PF09429 (Finn et al. 2016) detected before (Vosseberg et al. 2021) were used instead of the BLAST-based homologs.
In case of an easy or ancient duplication scenario, a LECA OG was defined. If members of this LECA OG were present in both the human and yeast spliceosomes, it was classified as a LECA spliceosome OG. Yeast LIN1 (CD2BP2 ortholog) and PRP24 (SART3 ortholog) and human LUC7L and LUC7L2 (LUC7 orthologs) were not in the initial set, but their ortholog was. These were included in the original list because these were also clearly described as spliceosomal in the literature. If an ortholog was not present in yeast, spliceosomal annotations for orthologs in S. pombe, A. thaliana (both in the UniProt database) or Cryptococcus neoformans (Sales-Lee et al. 2021) were checked. If an ortholog was not present in humans, the function of the A. thaliana ortholog was investigated. If these orthologs were not characterized, they were classified as spliceosomal in LECA if their close paralog was also in the spliceosome, or if they only had an annotated spliceosomal function. If their main function was in the spliceosome or if they were not well-characterized, they were classified as possibly spliceosomal. In case of multiple functions, the OG was discarded. The reconstruction of spliceosome OGs in LECA is summarized in supplementary Table S3, Supplementary Material online.

Inferring pre-LECA Evolutionary Histories
To trace the pre-LECA histories of the inferred spliceosomal LECA proteins, we performed phylogenetic analyses of these proteins with other eukaryotic OGs and with prokaryotic proteins that are homologous to the spliceosomal proteins. We started by analyzing the domain composition of the proteins and looking for these domains or fulllength proteins in trees that we created for a previous study (Vosseberg et al. 2021). Additional phylogenetic analyses were performed for the families described below. Multiple sequence alignments were made with MAFFT v7.310 (Katoh and Standley 2013) and, subsequently, trimmed to remove parts of the alignment of low quality with trimAl v1.4.rev15 (Capella-Gutiérrez et al. 2009) (Minh et al. 2020) with the best substitution model among nuclear models including LG + C{10,20,30,40,50,60} mixture models identified by ModelFinder (Kalyaanamoorthy et al. 2017). Mixtures models with an F-class were not considered, as recently recommended (Baños et al. 2022). Branch supports were calculated with 1,000 ultrafast bootstraps Vosseberg et al. · https://doi.org/10.1093/molbev/msad011 MBE (Hoang et al. 2018) and the SH-like approximate likelihood ratio test (Guindon et al. 2010). Topologies were compared using the approximately unbiased test (Shimodaira 2002) with 10,000 replicates.
Duplications that resulted in spliceosomal OGs were functionally annotated based on the function of other homologous sequences and the tree topology using Dollo parsimony. In two cases for which we could only detect one paralogous OG with a non-spliceosomal function, the preduplication ancestor was annotated with the non-spliceosomal function. Clades containing only within-spliceosome duplications were collapsed to obtain ancestral spliceosomal units.

IEP-PRPF8
Representative sequences of prokaryotic and organellar IEP sequences and other prokaryotic RT-containing sequences were chosen from two datasets (Candales et al. 2012;Toro and Nisa-Martínez 2014) and supplemented with four Asgard archaeal IEP sequences (Zaremba-Niedzwiedzka et al. 2017). We also selected slowly evolving representatives for PRPF8 and TERT. For the tree that included PRPF8 and TERT, separate alignments were made for the prokaryotic and organellar IEP (E-INS-i algorithm), PRPF8 and TERT sequences (both with L-INS-i). We extracted the RT fingers-palm and thumb domains from these alignments based on a published structural alignment (Qu et al. 2016). The extracted domains were aligned and a tree was inferred. A constrained tree search with a monophyletic PRPF8 and TERT clade was additionally performed.
We  (Buchfink et al. 2015) searches on the expanded Asgard set. Proteins assigned to COG3344 were combined with the selection of IEP sequences; non-IEP COG3344 hits were discarded based on a preliminary phylogenetic tree.

AAR2
Only three prokaryotic AAR2 homologs were detected in the initial dataset based on hits to the PF05282 model (Vosseberg et al. 2021), one in Limnospira maxima and two in Lokiarchaeum. We used the same approach to detect additional hits in the expanded set of Asgard archaea by running hmmsearch (HMMER v3.3.2 (Eddy 2011)) with the Pfam 31.0 hidden Markov models (Finn et al. 2016) using the gathering thresholds. Additionally, hmmsearch with the PF05282.14 model was performed on the EBI server (https://www.ebi.ac.uk/Tools/hmmer/search/hmmsearch) against the UniProtKB database on April 21, 2022.

EFTUD2
The EF2 family has undergone multiple duplications in archaeal and eukaryotic evolution resulting in two orthologs in the last Asgard archaeal common ancestor and three in LECA (Narrowe et al. 2018). The latter are represented in eukaryotic eggNOG families (euNOGs) KOG0467, KOG0468, and KOG0469. To increase the phylogenetic resolution, we used a ScrollSaw-inspired approach (Elias et al. 2012;Vosseberg et al. 2021;van Wijk and Snel 2022) to select slowly evolving sequences from four main eukaryotic clades (Amorphea, Diaphoretickes, Discoba, and Metamonada). Asgard archaeal sequences assigned to COG0480 were aligned with E-INS-i. The alignment was trimmed with trimAl (-gt 0.5) and a tree was inferred using the LG + G4 model. Hodarchaeal representatives and other Asgard sequences from the same Asgard archaeal OG (see Supplementary Information) were combined with the eukaryotic sequences.

PRPF31 and SNU13
For PRPF31, the sequences in the PF01798 tree were replaced with the corresponding full-length sequences to increase the phylogenetic signal. Based on the PF01248 tree, which includes SNU13, we chose two slowly evolving Opimoda and two Diphoda sequences (Derelle et al. 2015) per OG, supplemented with the archaeal RPL7Ae sequences. Full-length sequences were used for subsequent phylogenetic inference.

DDX Helicases
Slowly evolving eukaryotic DDX helicase sequences were selected using the ScrollSaw-based approach on the sequences that were assigned to the euNOGs that are part of the COG0513 cluster (Makarova et al. 2005). An alignment of these sequences was created (E-INS-i, trimAl -gt 0.5) and a phylogenetic tree was inferred with FastTree v2.1.10 (LG model) (Price et al. 2010). From this tree, we selected per OG the sequence on the shortest branch for each of the four eukaryotic clades (if present and not on a deviating long branch). The selected sequences were split into the two inferred acquisitions and combined with prokaryotic COG0513 representatives.

DHX Helicases
A similar approach as for the DDX helicases was applied to the COG1643 cluster (Makarova et al. 2005). The initial tree was based on an alignment created with E-INS-i and trimAl (gappyout option) and made using the LG + F + R8 model in IQ-TREE. An unclear clade with multiple OGs was reduced and sequences from the missing DHX40 OG were added.

LSM
To elucidate the pre-LECA history of the Lsm/Sm proteins, we initially made a tree combining the eukaryotic sequences from LECA OGs in the Sm-like Pfam clan (PF01423, PF12701, and PF14438). We selected slowly evolving sequences as described for the DDX and DHX helicases from the resulting tree (alignment with FFT-NS-I, trimming with trimAl (-gt 0.1), tree with the LG + G4 model). LSM14 and ATXN2 were not included in the selection because of their divergent nature. The full-length sequences in the expanded Phylogenetics and Intron Positions Illuminate Spliceosomal Origin · https://doi.org/10.1093/molbev/msad011 MBE set of Asgard archaea that were PF01423 hits were used for the SmAP tree. We selected representatives from the different clades and combined these with the full-length versions of the previously selected eukaryotic sequences. We also performed a constrained tree search with one monophyletic eukaryotic clade.

RRM and TXNL4
We identified LECA OGs in the PF00076 (RRM) tree based on automatic annotation and manual assessment (i.e., a high support value and substantial pre-LECA branch length). Per OG, the Opimoda and Diphoda sequences on the shortest branch were selected. For the different subtrees, we selected full-length sequences in the OGs from H. sapiens, A. castellanii, A. thaliana, Aphanomyces astaci, Monocercomonoides sp., and N. gruberi. For RBM41, the Selaginella moellendorffii sequence was included to replace the missing A. thaliana ortholog. To illustrate the relationship between TXNL4A and TXNL4B in the larger thioredoxin family, we used orthologs from the same species as chosen for the RRM subtrees.

U1-type Zinc Finger
Slowly evolving sequences from the euNOGs in the smart00451 cluster (Makarova et al. 2005), supplemented with the SCNM1 euNOG ENOG410IW6J, were selected with the aforementioned ScrollSaw-based approach. These sequences were aligned with the E-INS-i algorithm and the resulting alignment was trimmed with trimAl (-gt 0.25). Based on the inferred tree with the VT + R4 model, we selected the shortest branching sequences per OG from each of the four eukaryotic groups.

WD40
The ScrollSaw-based approach was also applied to the euNOGs in the COG2319 cluster (Makarova et al. 2005), using bidirectional best hits between the Opimoda and Diphoda species instead, because of the size of this protein family. An alignment of the selected sequences was made (E-INS-i, trimAl gappyout) and a tree was inferred (LG + R4 model). Per OG, the shortest branching Opimoda and Diphoda sequences were chosen. PPWD1 and some potential sister OGs based on the BLAST trees were not in the COG2319 cluster. We followed a similar approach to identify slowly evolving sequences for these euNOGs (KOG0882, ENOG410IQTX, −0KD7K, and −0IF90), using a different gap threshold (50%) and substitution model (LG + R3). Based on the BLAST trees and the COG2319 cluster tree, we identified potential sister OGs and inferred a tree with these OGs and the spliceosomal OGs.

Ancestral Intron Position Reconstructions
We performed ancestral intron position reconstructions for the identified pre-LECA paralogs in the entire clade or only for the spliceosomal OGs and sister OGs (supplementary Table S5, Supplementary Material online), depending on the number of OGs in an acquisition or invention. To establish the content of the OGs, we started with the euNOG assignments. If the taxonomic distribution of the euNOG was limited, we continued with the Broccoli (Derelle et al. 2020) OG assignments (Deutekom et al., 2021). A phylogenetic tree of the OG was inferred to check for the presence of non-orthologous or dubious sequences and remove these (E-INS-i, trimAl -gt 0.5 or -gappyout, FastTree -lg). After cleaning up the OGs, a final E-INS-i alignment was made. Except for the alignment with PRPF8 and TERT, which was based on the RT domain (see "IEP-PRPF8' above), full-length sequences were used for this alignment. Intron positions were mapped onto the alignment using the method described before (Vosseberg et al. 2022). LECA introns were inferred with Malin (Csűrös 2008) using the intron gain and loss rates that we previously estimated for the KOG clusters (Vosseberg et al. 2022). Pre-duplication introns were inferred using Dollo parsimony.

Recurrent Duplication and Subfunctionalization of SNF
To identify post-LECA duplications, SNF sequences were aligned with E-INS-i and this alignment was trimmed with Divvier. The SNF tree was inferred with the LG + C50 + R6 model and manually reconciled with the species tree to annotate gene duplication events. We looked at potential duplications in more detail by remaking trees of specific parts of the tree, including additional species from our original set (Deutekom et al. 2019). Prior to making the final alignment, we removed additional in-paralogs, probable fission events or partial annotations and the sequences from Guillardia theta, which had likely acquired a third copy from its endosymbiont. The final alignment was made with the E-INS-i algorithm. This alignment and the annotated duplication events were used as input for our previously published pipeline to identify patterns of recurrent sequence evolution after independent gene duplications (von der Dunk and Snel 2020).